Re: [patch 04/17] Add pagetable accessors to pack and unpack pagetable entries

2007-04-04 Thread Jeremy Fitzhardinge
Andi Kleen wrote:
> What do the benchmarks say with CONFIG_PARAVIRT on native hardware
> compared to !CONFIG_PARAVIRT. e.g. does lmbench suffer? 

Barely.  There's a slight hit for not using patching, and patching is
almost identical to native performance.  The most noticeable difference
is in the null syscall microbenchmark, but once you get to complex
things the difference is in the noise.

Processor, Processes - times in microseconds - smaller is better
--
Host OS  Mhz null null  open slct sig  sig  fork exec sh  
 call  I/O stat clos TCP  inst hndl proc proc proc
- -           
non-paravirt
ezr   Linux 2.6.21- 1000 0.25 0.52 31.6 34.7 10.3 1.03 5.31 726. 1565 4520
ezr   Linux 2.6.21- 1000 0.25 0.52 31.8 34.7 12.6 1.03 5.41 725. 1564 4585
ezr   Linux 2.6.21- 1000 0.25 0.55 31.7 34.5 11.8 1.02 5.47 720. 1595 4518

paravirt, no patching
ezr   Linux 2.6.21- 1000 0.28 0.55 31.3 34.3 10.0 1.05 5.56 747. 1621 4675
ezr   Linux 2.6.21- 1000 0.28 0.56 31.5 34.3 12.9 1.05 5.66 755. 1629 4684
ezr   Linux 2.6.21- 1000 0.28 0.55 31.8 34.5 12.5 1.05 5.45 747. 1622 4695

paravirt, patching
ezr   Linux 2.6.21- 1000 0.25 0.53 31.8 34.4 10.1 1.04 5.44 730. 1583 4600
ezr   Linux 2.6.21- 1000 0.26 0.55 32.1 35.2 13.3 1.03 5.48 748. 1589 4606
ezr   Linux 2.6.21- 1000 0.26 0.54 32.0 34.9 14.1 1.04 5.43 752. 1606 4647


J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 04/17] Add pagetable accessors to pack and unpack pagetable entries

2007-04-04 Thread Andi Kleen
On Wednesday 04 April 2007 11:25:57 Jeremy Fitzhardinge wrote:
> Andi Kleen wrote:
> > What do the benchmarks say with CONFIG_PARAVIRT on native hardware
> > compared to !CONFIG_PARAVIRT. e.g. does lmbench suffer? 
> 
> Barely.  There's a slight hit for not using patching, and patching is
> almost identical to native performance.  The most noticeable difference
> is in the null syscall microbenchmark, but once you get to complex
> things the difference is in the noise.

Why is there a difference for null syscall? I had assumed we patched all the 
fast path cases relevant there. Do you have an idea where it comes from?

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: A set of "standard" virtual devices?

2007-04-04 Thread Arnd Bergmann
On Wednesday 04 April 2007, H. Peter Anvin wrote:
> Configuration space access is platform-dependent.  It's only defined to 
> work in a specific way on x86 platforms.
> 
> "Interrupt swizzling" is really totally independent of PCI.  ALL PCI 
> really provides is up to four interrupts per device (not counting 
> MSI/MSI-X) and an 8-bit writable field which the platform can choose to 
> use to hold interrupt information.  That's all.  The rest is all 
> platform information.
> 
> PCI enumeration is hardly complex.  Most of the stuff that doesn't apply 
> to you you can generally ignore, as is done by other busses like 
> HyperTransport when they emulate PCI.

You still don't get my point: On a platform that doesn't have interrupt
numbers, and where most of the fields in the config space don't correspond
do anything that is already there, you really don't want to invent
a set of new hcalls that implement emulation, to get something as
simple as a pipe.

wc drivers/pci/*.[ch] include/asm-i386/{pci,io}.h lib/iomap*.c \
arch/i386/pci/*.c kernel/irq/*.c
17015  59037 463967 total

Even if you only need half of that code in reality, reimplementing
all that in both the kernel and in the hypervisor is an enourmous
effort. We've seen that before on the ps3, which initially faked
a virtual PCI bus just for the USB controller, but doing something
like that requires adding abstraction layers, to decide whether to
implement e.g. an inb as a hypercall or as a memory read.

> That being said, on platforms which are PCI-centric, such as x86, this 
> of course makes it a lot easier to produce virtual devices which work 
> across hypervisors, since the device model, of *any* operating system is 
> set up to handle them.

Yes, as I said there are two separate problems. I really think that
a standardized virtual driver interface should be modeled after
kernel <-> user interfaces, not hardware <-> kernel interfaces.

Once we know what operations we want (e.g. read, write and SIGIO,
or some other set of primitives), it will be good to provide a
virtual PCI device that can be used as one transport mechanism
below it. Using PCI device IDs to tell what functionality is
provided by the device would provide a reasonable method for
autoprobing.

Arnd <><

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 04/17] Add pagetable accessors to pack and unpack pagetable entries

2007-04-04 Thread Jeremy Fitzhardinge
Andi Kleen wrote:
> Why is there a difference for null syscall? I had assumed we patched all the 
> fast path cases relevant there. Do you have an idea where it comes from?

Sure.  There's indirect calls for things like sti/cli/iret.  It goes
back to native speed when you patch the real instructions inline.

J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: A set of "standard" virtual devices?

2007-04-04 Thread H. Peter Anvin
Arnd Bergmann wrote:
> 
>> That being said, on platforms which are PCI-centric, such as x86, this 
>> of course makes it a lot easier to produce virtual devices which work 
>> across hypervisors, since the device model, of *any* operating system is 
>> set up to handle them.
> 
> Yes, as I said there are two separate problems. I really think that
> a standardized virtual driver interface should be modeled after
> kernel <-> user interfaces, not hardware <-> kernel interfaces.
> 
> Once we know what operations we want (e.g. read, write and SIGIO,
> or some other set of primitives), it will be good to provide a
> virtual PCI device that can be used as one transport mechanism
> below it. Using PCI device IDs to tell what functionality is
> provided by the device would provide a reasonable method for
> autoprobing.
> 

That seems like a reasonable approach.  I *do* care about 
hardware-equivalent interfaces, because they, too, keep getting 
reinvented, but it seems reasonable to approach it in a layered fashion 
like you describe.

-hpa

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 04/17] Add pagetable accessors to pack and unpack pagetable entries

2007-04-04 Thread Andi Kleen
On Wednesday 04 April 2007 17:45:44 Jeremy Fitzhardinge wrote:
> Andi Kleen wrote:
> > Why is there a difference for null syscall? I had assumed we patched all 
> > the 
> > fast path cases relevant there. Do you have an idea where it comes from?
> 
> Sure.  There's indirect calls for things like sti/cli/iret.  It goes
> back to native speed when you patch the real instructions inline.

I was talking about the patched case. It seemed to be a little slower
too, but in theory it shouldn't have been, no? 

-Andi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH] Unified lguest launcher

2007-04-04 Thread Glauber de Oliveira Costa

This is a new version of the unified lguest launcher that applies to
the current tree. According to rusty's suggestion, I'm bothering less
to be able to load 32 bit kernels on 64-bit machines: changing the
launcher for such case would be the easy part! In the absence of
further objections, I'll commit it.

Signed-off-by: Glauber de Oliveira Costa <[EMAIL PROTECTED]>

--
Glauber de Oliveira Costa.
"Free as in Freedom"

"The less confident you are, the more serious you have to act."
Index: linux-2.6.20/Documentation/lguest/Makefile
===
--- linux-2.6.20.orig/Documentation/lguest/Makefile
+++ linux-2.6.20/Documentation/lguest/Makefile
@@ -1,12 +1,13 @@
 # This creates the demonstration utility "lguest" which runs a Linux guest.
 
-# We rely on CONFIG_PAGE_OFFSET to know where to put lguest binary.
-# Some shells (dash - ubunu) can't handle numbers that big so we cheat.
+#we could uname -i, but it seems to return unknown on a bunch locations
+ARCH:=$(shell uname -m | sed s/i[3456]86/i386/)
+
 include ../../.config
-LGUEST_GUEST_TOP := ($(CONFIG_PAGE_OFFSET) - 0x0800)
+include $(ARCH)/defines
 
 CFLAGS:=-Wall -Wmissing-declarations -Wmissing-prototypes -O3 \
-	-static -DLGUEST_GUEST_TOP="$(LGUEST_GUEST_TOP)" -Wl,-T,lguest.lds
+	-static -DLGUEST_GUEST_TOP="$(LGUEST_GUEST_TOP)" -Wl,-T,lguest.lds -I$(ARCH)
 LDLIBS:=-lz
 
 all: lguest.lds lguest
Index: linux-2.6.20/Documentation/lguest/i386/defines
===
--- /dev/null
+++ linux-2.6.20/Documentation/lguest/i386/defines
@@ -0,0 +1,4 @@
+# We rely on CONFIG_PAGE_OFFSET to know where to put lguest binary.
+# Some shells (dash - ubunu) can't handle numbers that big so we cheat.
+include ../../.config
+LGUEST_GUEST_TOP := ($(CONFIG_PAGE_OFFSET) - 0x0800)
Index: linux-2.6.20/Documentation/lguest/i386/lguest_defs.h
===
--- /dev/null
+++ linux-2.6.20/Documentation/lguest/i386/lguest_defs.h
@@ -0,0 +1,9 @@
+#ifndef _LGUEST_DEFS_H_
+#define _LGUEST_DEFS_H_
+
+/* LGUEST_TOP_ADDRESS comes from the Makefile */
+#define RESERVE_TOP_ADDRESS LGUEST_GUEST_TOP - 1024*1024
+
+#include "../../../include/linux/lguest_launcher.h"
+
+#endif
Index: linux-2.6.20/Documentation/lguest/lguest.c
===
--- linux-2.6.20.orig/Documentation/lguest/lguest.c
+++ linux-2.6.20/Documentation/lguest/lguest.c
@@ -33,7 +33,7 @@
 typedef uint32_t u32;
 typedef uint16_t u16;
 typedef uint8_t u8;
-#include "../../include/linux/lguest_launcher.h"
+#include 
 
 #define PAGE_PRESENT 0x7 	/* Present, RW, Execute */
 #define NET_PEERNUM 1
@@ -64,7 +64,7 @@ struct device
 
 	/* Watch DMA to this key if handle_input non-NULL. */
 	unsigned long watch_key;
-	u32 (*handle_output)(int fd, const struct iovec *iov,
+	unsigned long (*handle_output)(int fd, const struct iovec *iov,
 			 unsigned int num, struct device *me);
 
 	/* Device-specific data. */
@@ -94,20 +94,29 @@ static void *map_zeroed_pages(unsigned l
 }
 
 /* Returns the entry point */
-static unsigned long map_elf(int elf_fd, const Elf32_Ehdr *ehdr,
+
+static unsigned long map_elf(int elf_fd, const void *hdr, 
 			 unsigned long *page_offset)
 {
-	void *addr;
+#ifndef __x86_64__
+	const Elf32_Ehdr *ehdr = hdr;
 	Elf32_Phdr phdr[ehdr->e_phnum];
+#else
+	const Elf64_Ehdr *ehdr = hdr;
+	Elf64_Phdr phdr[ehdr->e_phnum];
+#endif
+	void *addr;
 	unsigned int i;
 
 	/* Sanity checks. */
 	if (ehdr->e_type != ET_EXEC
-	|| ehdr->e_machine != EM_386
-	|| ehdr->e_phentsize != sizeof(Elf32_Phdr)
-	|| ehdr->e_phnum < 1 || ehdr->e_phnum > 65536U/sizeof(Elf32_Phdr))
+	|| ((ehdr->e_machine != EM_386) &&
+		(ehdr->e_machine != EM_X86_64))
+	|| ehdr->e_phentsize != sizeof(phdr[0])
+	|| ehdr->e_phnum < 1 || ehdr->e_phnum > 65536U/sizeof(phdr[0]))
 		errx(1, "Malformed elf header");
 
+
 	if (lseek(elf_fd, ehdr->e_phoff, SEEK_SET) < 0)
 		err(1, "Seeking to program headers");
 	if (read(elf_fd, phdr, sizeof(phdr)) != sizeof(phdr))
@@ -120,13 +129,17 @@ static unsigned long map_elf(int elf_fd,
 		if (phdr[i].p_type != PT_LOAD)
 			continue;
 
-		verbose("Section %i: size %i addr %p\n",
-			i, phdr[i].p_memsz, (void *)phdr[i].p_paddr);
+		verbose("Section %i: size %lu addr %p\n",
+		i, (unsigned long)phdr[i].p_memsz, (void *)phdr[i].p_paddr);
 
 		/* We expect linear address space. */
 		if (!*page_offset)
 			*page_offset = phdr[i].p_vaddr - phdr[i].p_paddr;
-		else if (*page_offset != phdr[i].p_vaddr - phdr[i].p_paddr)
+		else if ((*page_offset != phdr[i].p_vaddr - phdr[i].p_paddr)
+#ifdef __x86_64__
+			 && (phdr[i].p_vaddr != VSYSCALL_START)
+#endif
+			)
 			errx(1, "Page offset of section %i different", i);
 
 		/* We map everything private, writable. */
@@ -210,15 +223,18 @@ static unsigned long load_bzimage(int fd
 	errx(1, "Could not find kernel in bzImage");
 }
 
-static unsigned lon

Re: [patch 04/17] Add pagetable accessors to pack and unpack pagetable entries

2007-04-04 Thread Jeremy Fitzhardinge
Andi Kleen wrote:
> I was talking about the patched case. It seemed to be a little slower
> too, but in theory it shouldn't have been, no? 

Oh, the .25 vs .26?  That's definitely noise; on other runs I see no
difference:

paravirt, patched, busy machine
ezr   Linux 2.6.21- 1000 0.25 0.52 31.3 34.4 12.7 1.02 5.44 743. 1578 4582
ezr   Linux 2.6.21- 1000 0.25 0.53 31.4 34.2 12.1 1.02 5.27 736. 1594 4806
ezr   Linux 2.6.21- 1000 0.25 0.52 31.5 34.4 14.3 1.02 5.49 736. 1587 4601

(I didn't use this result because I did it on a fairly busy machine, but
the best-case numbers are actually somewhat better than the run on a
completely quiet machine.)


J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 1/6] Re-enable VDSO by default with PARAVIRT

2007-04-04 Thread Andi Kleen
On Wednesday 04 April 2007 03:06:56 Jeremy Fitzhardinge wrote:
> Everyone wants VDSO to be enabled by default.  COMPAT_VDSO still needs
> a fix, but with luck that will turn up soon.

Hmm, that would break at least my test system until Jan's patch is applied.

-Andi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 1/6] Re-enable VDSO by default with PARAVIRT

2007-04-04 Thread Jeremy Fitzhardinge
Andi Kleen wrote:
> Hmm, that would break at least my test system until Jan's patch is applied.
>   

Boot with vdso=0?

J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH] lguest32 kallsyms backtrace of guest.

2007-04-04 Thread Steven Rostedt
This is taken from the work I did on lguest64.

When killing a guest, we read the guest stack to do a nice back trace of
the guest and send it via printk to the host.

So instead of just getting an error message from the lguest launcher of:

lguest: bad read address 537012178 len 1

I also get in my dmesg:

called from  [] show_trace_log_lvl+0x1a/0x2f
 [] show_trace+0x12/0x14
 [] dump_stack+0x16/0x18
 [] lguest_dump_lg_regs+0x22/0x13c [lg]
 [] lgread+0x59/0x90 [lg]
 [] run_guest+0x26b/0x406 [lg]
 [] read+0x73/0x7d [lg]
 [] vfs_read+0xad/0x161
 [] sys_read+0x3d/0x61
 [] syscall_call+0x7/0xb
 ===
[] lgread+0x59/0x90 [lg]
Printing LG 0 regs cr3: 021eb000
EIP: 0061:  []
ESP: 0069:c236fe3c  EFLAGS: 00010202
EAX: 0004 EBX: e001fb20 ECX: 0008 EDX: 03f2
ESI: e001ee00 EDI: e001fb60 EBP: c236fea0
 CR2: 1278000  lguest_data->cr2: 80011380
errcode: 0   trapnum: d
Stack Dump:
 [] trace_hardirqs_on+0x125/0x149
 [] wait_for_completion+0x90/0x98
 [] __mutex_unlock_slowpath+0x129/0x13e
 [] unlock_cpu_hotplug+0x62/0x64
 [] sys_init_module+0x14e3/0x162c
 [] do_sync_read+0xc2/0xff
 [] restore_nocheck+0x12/0x15
 [] syscall_call+0x7/0xb


TODO:

  - Clean up a little (still has stuff from lguest64 in it).
  - Perhaps make a config option or runtime switch to turn it off.
  - Send to the launcher the dump instead of printk.
  - make modules work too.

Also I need to change the %u of the bad read print to a %x, because
seeing 0x200227d2 is better than seeing 537012178 for addresses.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc5-mm2/drivers/lguest/Makefile
===
--- linux-2.6.21-rc5-mm2.orig/drivers/lguest/Makefile
+++ linux-2.6.21-rc5-mm2/drivers/lguest/Makefile
@@ -4,4 +4,4 @@ obj-$(CONFIG_LGUEST_GUEST) += lguest.o l
 # Host requires the other files, which can be a module.
 obj-$(CONFIG_LGUEST)   += lg.o
 lg-objs := core.o hypercalls.o page_tables.o interrupts_and_traps.o \
-   segments.o io.o lguest_user.o hypervisor.o
+   segments.o io.o lguest_user.o hypervisor.o lguest_debug.o
Index: linux-2.6.21-rc5-mm2/drivers/lguest/core.c
===
--- linux-2.6.21-rc5-mm2.orig/drivers/lguest/core.c
+++ linux-2.6.21-rc5-mm2/drivers/lguest/core.c
@@ -210,6 +210,28 @@ int lguest_address_ok(const struct lgues
 }
 
 /* Just like get_user, but don't let guest access lguest binary. */
+u8 lgread_u8(struct lguest *lg, u32 addr)
+{
+   u8 val = 0;
+
+   /* Don't let them access lguest binary */
+   if (!lguest_address_ok(lg, addr)
+   || get_user(val, (u32 __user *)addr) != 0)
+   kill_guest(lg, "bad read address %u", addr);
+   return val;
+}
+
+u16 lgread_u16(struct lguest *lg, u32 addr)
+{
+   u16 val = 0;
+
+   /* Don't let them access lguest binary */
+   if (!lguest_address_ok(lg, addr)
+   || get_user(val, (u32 __user *)addr) != 0)
+   kill_guest(lg, "bad read address %u", addr);
+   return val;
+}
+
 u32 lgread_u32(struct lguest *lg, u32 addr)
 {
u32 val = 0;
Index: linux-2.6.21-rc5-mm2/drivers/lguest/lg.h
===
--- linux-2.6.21-rc5-mm2.orig/drivers/lguest/lg.h
+++ linux-2.6.21-rc5-mm2/drivers/lguest/lg.h
@@ -176,6 +176,8 @@ extern struct mutex lguest_lock;
 /* core.c: */
 /* Entry points in hypervisor */
 const unsigned long *__lguest_default_idt_entries(void);
+u8 lgread_u8(struct lguest *lg, u32 addr);
+u16 lgread_u16(struct lguest *lg, u32 addr);
 u32 lgread_u32(struct lguest *lg, u32 addr);
 void lgwrite_u32(struct lguest *lg, u32 val, u32 addr);
 void lgread(struct lguest *lg, void *buf, u32 addr, unsigned bytes);
@@ -238,6 +240,7 @@ int hypercall(struct lguest *info, struc
 #define kill_guest(lg, fmt...) \
 do {   \
if (!(lg)->dead) {  \
+   lguest_dump_lg_regs(lg);\
(lg)->dead = kasprintf(GFP_ATOMIC, fmt);\
if (!(lg)->dead)\
(lg)->dead = (void *)1; \
@@ -248,5 +251,11 @@ static inline unsigned long guest_pa(str
 {
return vaddr - lg->page_offset;
 }
+
+/* lguest_debug.c */
+void lguest_print_address(struct lguest *lg, unsigned long address);
+void lguest_dump_trace(struct lguest *lg, struct lguest_regs *regs);
+void lguest_dump_lg_regs(struct lguest *lg);
+
 #endif /* __ASSEMBLY__ */
 #endif /* _LGUEST_H */
Index: linux-2.6.21-rc5-mm2/drivers/lguest/lguest.c
===
--- linux-2.6.21-rc5-mm2.orig/drivers/l

[PATCH] Lguest32, use guest page tables to find paddr for emulated instructions

2007-04-04 Thread Steven Rostedt
[Bug that was found by my previous patch]

This patch allows things like modules, which don't have a direct
__pa(EIP) mapping to do emulated instructions.

Sure, the emulated instruction probably should be a paravirt_op, but
this patch lets you at least boot a kernel that has modules needing
emulated instructions.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc5-mm2/drivers/lguest/core.c
===
--- linux-2.6.21-rc5-mm2.orig/drivers/lguest/core.c
+++ linux-2.6.21-rc5-mm2/drivers/lguest/core.c
@@ -160,11 +160,14 @@ static int emulate_insn(struct lguest *l
 {
u8 insn;
unsigned int insnlen = 0, in = 0, shift = 0;
-   unsigned long physaddr = guest_pa(lg, lg->regs->eip);
+   unsigned long physaddr = lguest_find_guest_paddr(lg, lg->regs->eip);
 
-   /* This only works for addresses in linear mapping... */
-   if (lg->regs->eip < lg->page_offset)
+   /* FIXME: Handle physaddr's that crosses pages (modules are in VM) */
+
+   /* did we actually find the physaddr? */
+   if (physaddr == (unsigned long)-1UL)
return 0;
+
lgread(lg, &insn, physaddr, 1);
 
/* Operand size prefix means it's actually for ax. */
Index: linux-2.6.21-rc5-mm2/drivers/lguest/lg.h
===
--- linux-2.6.21-rc5-mm2.orig/drivers/lguest/lg.h
+++ linux-2.6.21-rc5-mm2/drivers/lguest/lg.h
@@ -218,6 +218,7 @@ void guest_set_pte(struct lguest *lg, un
 void map_hypervisor_in_guest(struct lguest *lg, struct lguest_pages *pages);
 int demand_page(struct lguest *info, unsigned long cr2, int write);
 void pin_page(struct lguest *lg, unsigned long vaddr);
+unsigned long lguest_find_guest_paddr(struct lguest *lg, unsigned long vaddr);
 
 /* lguest_user.c: */
 int lguest_device_init(void);
Index: linux-2.6.21-rc5-mm2/drivers/lguest/page_tables.c
===
--- linux-2.6.21-rc5-mm2.orig/drivers/lguest/page_tables.c
+++ linux-2.6.21-rc5-mm2/drivers/lguest/page_tables.c
@@ -105,6 +105,25 @@ static spte_t gpte_to_spte(struct lguest
return spte;
 }
 
+unsigned long lguest_find_guest_paddr(struct lguest *lg, unsigned long vaddr)
+{
+   gpgd_t gpgd;
+   gpte_t gpte;
+   unsigned long gpte_ptr;
+
+   gpgd = mkgpgd(lgread_u32(lg, gpgd_addr(lg, vaddr)));
+   if (!(gpgd.flags & _PAGE_PRESENT))
+   return -1;
+
+   gpte_ptr = gpte_addr(lg, gpgd, vaddr);
+   gpte = mkgpte(lgread_u32(lg, gpte_ptr));
+
+   if (!(gpte.flags & _PAGE_PRESENT))
+   return -1;
+
+   return (gpte.pfn << PAGE_SHIFT) | (vaddr & (PAGE_SIZE-1));
+}
+
 /* FIXME: We hold reference to pages, which prevents them from being
swapped.  It'd be nice to have a callback when Linux wants to swap out. */
 


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[PATCH] Lguest32 print hex on bad reads and writes

2007-04-04 Thread Steven Rostedt
Currently the lguest32 error messages from bad reads and writes prints a
decimal integer for addresses. This is pretty annoying. So this patch
changes those to be hex outputs.

This is applied on top of my debug patch.

Signed-off-by: Steven Rostedt <[EMAIL PROTECTED]>

Index: linux-2.6.21-rc5-mm2/drivers/lguest/core.c
===
--- linux-2.6.21-rc5-mm2.orig/drivers/lguest/core.c
+++ linux-2.6.21-rc5-mm2/drivers/lguest/core.c
@@ -220,7 +220,7 @@ u8 lgread_u8(struct lguest *lg, u32 addr
/* Don't let them access lguest binary */
if (!lguest_address_ok(lg, addr)
|| get_user(val, (u32 __user *)addr) != 0)
-   kill_guest(lg, "bad read address %u", addr);
+   kill_guest(lg, "bad read address %x", addr);
return val;
 }
 
@@ -231,7 +231,7 @@ u16 lgread_u16(struct lguest *lg, u32 ad
/* Don't let them access lguest binary */
if (!lguest_address_ok(lg, addr)
|| get_user(val, (u32 __user *)addr) != 0)
-   kill_guest(lg, "bad read address %u", addr);
+   kill_guest(lg, "bad read address %x", addr);
return val;
 }
 
@@ -242,7 +242,7 @@ u32 lgread_u32(struct lguest *lg, u32 ad
/* Don't let them access lguest binary */
if (!lguest_address_ok(lg, addr)
|| get_user(val, (u32 __user *)addr) != 0)
-   kill_guest(lg, "bad read address %u", addr);
+   kill_guest(lg, "bad read address %x", addr);
return val;
 }
 
@@ -250,7 +250,7 @@ void lgwrite_u32(struct lguest *lg, u32 
 {
if (!lguest_address_ok(lg, addr)
|| put_user(val, (u32 __user *)addr) != 0)
-   kill_guest(lg, "bad write address %u", addr);
+   kill_guest(lg, "bad write address %x", addr);
 }
 
 void lgread(struct lguest *lg, void *b, u32 addr, unsigned bytes)
@@ -259,7 +259,7 @@ void lgread(struct lguest *lg, void *b, 
|| copy_from_user(b, (void __user *)addr, bytes) != 0) {
/* copy_from_user should do this, but as we rely on it... */
memset(b, 0, bytes);
-   kill_guest(lg, "bad read address %u len %u", addr, bytes);
+   kill_guest(lg, "bad read address %x len %u", addr, bytes);
}
 }
 
@@ -268,7 +268,7 @@ void lgwrite(struct lguest *lg, u32 addr
if (addr + bytes < addr
|| !lguest_address_ok(lg, addr+bytes)
|| copy_to_user((void __user *)addr, b, bytes) != 0)
-   kill_guest(lg, "bad write address %u len %u", addr, bytes);
+   kill_guest(lg, "bad write address %x len %u", addr, bytes);
 }
 
 static void set_ts(unsigned int guest_ts)


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 04/17] Add pagetable accessors to pack and unpack pagetable entries

2007-04-04 Thread Rusty Russell
On Wed, 2007-04-04 at 17:56 +0200, Andi Kleen wrote:
> On Wednesday 04 April 2007 17:45:44 Jeremy Fitzhardinge wrote:
> > Andi Kleen wrote:
> > > Why is there a difference for null syscall? I had assumed we patched all 
> > > the 
> > > fast path cases relevant there. Do you have an idea where it comes from?
> > 
> > Sure.  There's indirect calls for things like sti/cli/iret.  It goes
> > back to native speed when you patch the real instructions inline.
> 
> I was talking about the patched case. It seemed to be a little slower
> too, but in theory it shouldn't have been, no? 

You'll still have the damage inflicted on gcc's optimizer, though.

Rusty.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 04/17] Add pagetable accessors to pack and unpack pagetable entries

2007-04-04 Thread Jeremy Fitzhardinge
Rusty Russell wrote:
> You'll still have the damage inflicted on gcc's optimizer, though.

Well, I could remove the clobbers for PVOP_CALL[0-2] and add the
appropriate push/pops, and put similar push/pop wrappers around all the
called functions.  But it doesn't make it any prettier.

J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[patch 09/20] rename struct paravirt_patch to paravirt_patch_site for clarity

2007-04-04 Thread Jeremy Fitzhardinge
Rename struct paravirt_patch to paravirt_patch_site, so that it
clearly refers to a callsite, and not the patch which may be applied
to that callsite.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Rusty Russell <[EMAIL PROTECTED]>
Cc: Zachary Amsden <[EMAIL PROTECTED]>

---
 arch/i386/kernel/alternative.c |9 -
 arch/i386/kernel/vmi.c |4 
 include/asm-i386/alternative.h |8 +---
 include/asm-i386/paravirt.h|5 -
 4 files changed, 13 insertions(+), 13 deletions(-)

===
--- a/arch/i386/kernel/alternative.c
+++ b/arch/i386/kernel/alternative.c
@@ -335,9 +335,10 @@ void alternatives_smp_switch(int smp)
 #endif
 
 #ifdef CONFIG_PARAVIRT
-void apply_paravirt(struct paravirt_patch *start, struct paravirt_patch *end)
-{
-   struct paravirt_patch *p;
+void apply_paravirt(struct paravirt_patch_site *start,
+   struct paravirt_patch_site *end)
+{
+   struct paravirt_patch_site *p;
 
if (noreplace_paravirt)
return;
@@ -355,8 +356,6 @@ void apply_paravirt(struct paravirt_patc
/* Sync to be conservative, in case we patched following instructions */
sync_core();
 }
-extern struct paravirt_patch __parainstructions[],
-   __parainstructions_end[];
 #endif /* CONFIG_PARAVIRT */
 
 void __init alternative_instructions(void)
===
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -70,10 +70,6 @@ static struct {
void (*set_initial_ap_state)(int, int);
void (*halt)(void);
 } vmi_ops;
-
-/* XXX move this to alternative.h */
-extern struct paravirt_patch __parainstructions[],
-   __parainstructions_end[];
 
 /*
  * VMI patching routines.
===
--- a/include/asm-i386/alternative.h
+++ b/include/asm-i386/alternative.h
@@ -114,12 +114,14 @@ static inline void alternatives_smp_swit
 #define LOCK_PREFIX ""
 #endif
 
-struct paravirt_patch;
+struct paravirt_patch_site;
 #ifdef CONFIG_PARAVIRT
-void apply_paravirt(struct paravirt_patch *start, struct paravirt_patch *end);
+void apply_paravirt(struct paravirt_patch_site *start,
+   struct paravirt_patch_site *end);
 #else
 static inline void
-apply_paravirt(struct paravirt_patch *start, struct paravirt_patch *end)
+apply_paravirt(struct paravirt_patch_site *start,
+  struct paravirt_patch_site *end)
 {}
 #define __parainstructions NULL
 #define __parainstructions_end NULL
===
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -502,12 +502,15 @@ void _paravirt_nop(void);
 #define paravirt_nop   ((void *)_paravirt_nop)
 
 /* These all sit in the .parainstructions section to tell us what to patch. */
-struct paravirt_patch {
+struct paravirt_patch_site {
u8 *instr;  /* original instructions */
u8 instrtype;   /* type of this instruction */
u8 len; /* length of original instruction */
u16 clobbers;   /* what registers you may clobber */
 };
+
+extern struct paravirt_patch_site __parainstructions[],
+   __parainstructions_end[];
 
 #define paravirt_alt(insn_string, typenum, clobber)\
"771:\n\t" insn_string "\n" "772:\n"\

-- 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[patch 08/20] add hooks to intercept mm creation and destruction

2007-04-04 Thread Jeremy Fitzhardinge
Add hooks to allow a paravirt implementation to track the lifetime of
an mm.  Paravirtualization requires three hooks, but only two are
needed in common code.  They are:

arch_dup_mmap, which is called when a new mmap is created at fork

arch_exit_mmap, which is called when the last process reference to an
  mm is dropped, which typically happens on exit and exec.

The third hook is activate_mm, which is called from the arch-specific
activate_mm() macro/function, and so doesn't need stub versions for
other architectures.  It's called when an mm is first used.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: linux-arch@vger.kernel.org
Cc: James Bottomley <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>

---
 arch/i386/kernel/paravirt.c |4 
 include/asm-alpha/mmu_context.h |1 +
 include/asm-arm/mmu_context.h   |1 +
 include/asm-arm26/mmu_context.h |2 ++
 include/asm-avr32/mmu_context.h |1 +
 include/asm-cris/mmu_context.h  |2 ++
 include/asm-frv/mmu_context.h   |1 +
 include/asm-generic/mm_hooks.h  |   18 ++
 include/asm-h8300/mmu_context.h |1 +
 include/asm-i386/mmu_context.h  |   17 +++--
 include/asm-i386/paravirt.h |   23 +++
 include/asm-ia64/mmu_context.h  |1 +
 include/asm-m32r/mmu_context.h  |1 +
 include/asm-m68k/mmu_context.h  |1 +
 include/asm-m68knommu/mmu_context.h |1 +
 include/asm-mips/mmu_context.h  |1 +
 include/asm-parisc/mmu_context.h|1 +
 include/asm-powerpc/mmu_context.h   |1 +
 include/asm-ppc/mmu_context.h   |1 +
 include/asm-s390/mmu_context.h  |2 ++
 include/asm-sh/mmu_context.h|1 +
 include/asm-sh64/mmu_context.h  |2 +-
 include/asm-sparc/mmu_context.h |2 ++
 include/asm-sparc64/mmu_context.h   |1 +
 include/asm-um/mmu_context.h|2 ++
 include/asm-v850/mmu_context.h  |2 ++
 include/asm-x86_64/mmu_context.h|1 +
 include/asm-xtensa/mmu_context.h|1 +
 kernel/fork.c   |2 ++
 mm/mmap.c   |4 
 30 files changed, 96 insertions(+), 3 deletions(-)

===
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -520,6 +520,10 @@ struct paravirt_ops paravirt_ops = {
.irq_enable_sysexit = native_irq_enable_sysexit,
.iret = native_iret,
 
+   .dup_mmap = paravirt_nop,
+   .exit_mmap = paravirt_nop,
+   .activate_mm = paravirt_nop,
+
.startup_ipi_hook = paravirt_nop,
 };
 
===
--- a/include/asm-alpha/mmu_context.h
+++ b/include/asm-alpha/mmu_context.h
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Force a context reload. This is needed when we change the page
===
--- a/include/asm-arm/mmu_context.h
+++ b/include/asm-arm/mmu_context.h
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 void __check_kvm_seq(struct mm_struct *mm);
 
===
--- a/include/asm-arm26/mmu_context.h
+++ b/include/asm-arm26/mmu_context.h
@@ -12,6 +12,8 @@
  */
 #ifndef __ASM_ARM_MMU_CONTEXT_H
 #define __ASM_ARM_MMU_CONTEXT_H
+
+#include 
 
 #define init_new_context(tsk,mm)   0
 #define destroy_context(mm)do { } while(0)
===
--- a/include/asm-avr32/mmu_context.h
+++ b/include/asm-avr32/mmu_context.h
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * The MMU "context" consists of two things:
===
--- a/include/asm-cris/mmu_context.h
+++ b/include/asm-cris/mmu_context.h
@@ -1,5 +1,7 @@
 #ifndef __CRIS_MMU_CONTEXT_H
 #define __CRIS_MMU_CONTEXT_H
+
+#include 
 
 extern int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
 extern void get_mmu_context(struct mm_struct *mm);
===
--- a/include/asm-frv/mmu_context.h
+++ b/include/asm-frv/mmu_context.h
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct 
*tsk)
 {
===
--- /dev/null
+++ b/include/asm-generic/mm_hooks.h
@@ -0,0 +1,18 @@
+/*
+ * Define generic no-op hooks for arch_dup_mmap and arch_exit_mmap, to
+ * be included in asm-FOO/mmu_context.h for any arch FOO which doesn't
+ * need to hook these.
+ */
+#ifndef _ASM_GENERIC_MM_HOOKS_H
+#define _ASM_GENERIC_MM_HOOKS_H
+
+static inline void arch_dup_mmap(struct mm_struct *oldmm,
+struct mm_struct *mm)
+{
+}
+
+static inline void arch_exit_

[patch 18/20] clean up tsc-based sched_clock

2007-04-04 Thread Jeremy Fitzhardinge
Three cleanups:
 - change "instable" -> "unstable"
 - its better to use get_cpu_var for getting this cpu's variables
 - change cycles_2_ns to do the full computation rather than just the
   tsc->ns scaling.  Its a simpler interface, and it makes the function
   more generally useful.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>

---
 arch/i386/kernel/sched-clock.c |   35 +--
 1 file changed, 21 insertions(+), 14 deletions(-)

===
--- a/arch/i386/kernel/sched-clock.c
+++ b/arch/i386/kernel/sched-clock.c
@@ -39,17 +39,23 @@
 
 struct sc_data {
unsigned int cyc2ns_scale;
-   unsigned char instable;
+   unsigned char unstable;
unsigned long long last_tsc;
unsigned long long ns_base;
 };
 
 static DEFINE_PER_CPU(struct sc_data, sc_data);
 
-static inline unsigned long long cycles_2_ns(int cpu, unsigned long long cyc)
+static inline unsigned long long cycles_2_ns(unsigned long long cyc)
 {
-   struct sc_data *sc = &per_cpu(sc_data, cpu);
-   return (cyc * sc->cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
+   const struct sc_data *sc = &__get_cpu_var(sc_data);
+   unsigned long long ns;
+
+   cyc -= sc->last_tsc;
+   ns = (cyc * sc->cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
+   ns += sc->ns_base;
+
+   return ns;
 }
 
 /*
@@ -62,18 +68,19 @@ static inline unsigned long long cycles_
  */
 unsigned long long sched_clock(void)
 {
-   int cpu = get_cpu();
-   struct sc_data *sc = &per_cpu(sc_data, cpu);
unsigned long long r;
+   const struct sc_data *sc = &get_cpu_var(sc_data);
 
-   if (sc->instable) {
+   if (sc->unstable) {
/* TBD find a cheaper fallback timer than this */
r = ktime_to_ns(ktime_get());
} else {
get_scheduled_cycles(r);
-   r = ((u64)sc->ns_base) + cycles_2_ns(cpu, r - sc->last_tsc);
+   r = cycles_2_ns(r);
}
-   put_cpu();
+
+   put_cpu_var(sc_data);
+
return r;
 }
 
@@ -81,7 +88,7 @@ static void resync_sc_freq(struct sc_dat
 static void resync_sc_freq(struct sc_data *sc, unsigned int newfreq)
 {
if (!cpu_has_tsc) {
-   sc->instable = 1;
+   sc->unstable = 1;
return;
}
/* RED-PEN protect with seqlock? I hope that's not needed
@@ -90,7 +97,7 @@ static void resync_sc_freq(struct sc_dat
sc->ns_base = ktime_to_ns(ktime_get());
get_scheduled_cycles(sc->last_tsc);
sc->cyc2ns_scale = (100 << CYC2NS_SCALE_FACTOR) / newfreq;
-   sc->instable = 0;
+   sc->unstable = 0;
 }
 
 static void call_r_s_f(void *arg)
@@ -119,9 +126,9 @@ static int sc_freq_event(struct notifier
switch (event) {
case CPUFREQ_RESUMECHANGE:  /* needed? */
case CPUFREQ_PRECHANGE:
-   /* Mark TSC as instable until cpu frequency change is done
+   /* Mark TSC as unstable until cpu frequency change is done
   because we don't know when exactly it will change */
-   sc->instable = 1;
+   sc->unstable = 1;
break;
case CPUFREQ_SUSPENDCHANGE:
case CPUFREQ_POSTCHANGE:
@@ -163,7 +170,7 @@ static __init int init_sched_clock(void)
int i;
struct cpufreq_freqs f = { .cpu = get_cpu(), .new = 0 };
for_each_possible_cpu (i)
-   per_cpu(sc_data, i).instable = 1;
+   per_cpu(sc_data, i).unstable = 1;
WARN_ON(num_online_cpus() > 1);
call_r_s_f(&f);
put_cpu();

-- 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[patch 16/20] revert map_pt_hook.

2007-04-04 Thread Jeremy Fitzhardinge
Back out the map_pt_hook to clear the way for kmap_atomic_pte.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Zachary Amsden <[EMAIL PROTECTED]>

---
 arch/i386/kernel/paravirt.c |2 --
 arch/i386/kernel/vmi.c  |2 ++
 include/asm-i386/paravirt.h |7 ---
 include/asm-i386/pgtable.h  |   23 ---
 4 files changed, 6 insertions(+), 28 deletions(-)

===
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -303,8 +303,6 @@ struct paravirt_ops paravirt_ops = {
.flush_tlb_single = native_flush_tlb_single,
.flush_tlb_others = native_flush_tlb_others,
 
-   .map_pt_hook = paravirt_nop,
-
.alloc_pt = paravirt_nop,
.alloc_pd = paravirt_nop,
.alloc_pd_clone = paravirt_nop,
===
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -819,8 +819,10 @@ static inline int __init activate_vmi(vo
paravirt_ops.release_pt = vmi_release_pt;
paravirt_ops.release_pd = vmi_release_pd;
}
+#if 0
para_wrap(map_pt_hook, vmi_map_pt_hook, set_linear_mapping,
  SetLinearMapping);
+#endif
 
/*
 * These MUST always be patched.  Don't support indirect jumps
===
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -128,8 +128,6 @@ struct paravirt_ops
void (*flush_tlb_single)(unsigned long addr);
void (*flush_tlb_others)(const cpumask_t *cpus, struct mm_struct *mm,
 unsigned long va);
-
-   void (*map_pt_hook)(int type, pte_t *va, u32 pfn);
 
void (*alloc_pt)(u32 pfn);
void (*alloc_pd)(u32 pfn);
@@ -740,11 +738,6 @@ static inline void flush_tlb_others(cpum
PVOP_VCALL3(flush_tlb_others, &cpumask, mm, va);
 }
 
-static inline void paravirt_map_pt_hook(int type, pte_t *va, u32 pfn)
-{
-   PVOP_VCALL3(map_pt_hook, type, va, pfn);
-}
-
 static inline void paravirt_alloc_pt(unsigned pfn)
 {
PVOP_VCALL1(alloc_pt, pfn);
===
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -272,7 +272,6 @@ static inline void vmalloc_sync_all(void
  */
 #define pte_update(mm, addr, ptep) do { } while (0)
 #define pte_update_defer(mm, addr, ptep)   do { } while (0)
-#define paravirt_map_pt_hook(slot, va, pfn)do { } while (0)
 
 #define raw_ptep_get_and_clear(xp) native_ptep_get_and_clear(xp)
 #endif
@@ -481,24 +480,10 @@ extern pte_t *lookup_address(unsigned lo
 #endif
 
 #if defined(CONFIG_HIGHPTE)
-#define pte_offset_map(dir, address)   \
-({ \
-   pte_t *__ptep;  \
-   unsigned pfn = pmd_val(*(dir)) >> PAGE_SHIFT;   \
-   __ptep = (pte_t *)kmap_atomic(pfn_to_page(pfn),KM_PTE0);\
-   paravirt_map_pt_hook(KM_PTE0,__ptep, pfn);  \
-   __ptep = __ptep + pte_index(address);   \
-   __ptep; \
-})
-#define pte_offset_map_nested(dir, address)\
-({ \
-   pte_t *__ptep;  \
-   unsigned pfn = pmd_val(*(dir)) >> PAGE_SHIFT;   \
-   __ptep = (pte_t *)kmap_atomic(pfn_to_page(pfn),KM_PTE1);\
-   paravirt_map_pt_hook(KM_PTE1,__ptep, pfn);  \
-   __ptep = __ptep + pte_index(address);   \
-   __ptep; \
-})
+#define pte_offset_map(dir, address) \
+   ((pte_t *)kmap_atomic(pmd_page(*(dir)),KM_PTE0) + pte_index(address))
+#define pte_offset_map_nested(dir, address) \
+   ((pte_t *)kmap_atomic(pmd_page(*(dir)),KM_PTE1) + pte_index(address))
 #define pte_unmap(pte) kunmap_atomic(pte, KM_PTE0)
 #define pte_unmap_nested(pte) kunmap_atomic(pte, KM_PTE1)
 #else

-- 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[patch 06/20] Allocate a fixmap slot

2007-04-04 Thread Jeremy Fitzhardinge
Allocate a fixmap slot for use by a paravirt_ops implementation.  This
is intended for early-boot bootstrap mappings.  Once the zones and
allocator have been set up, it would be better to use get_vm_area() to
allocate some virtual space.

Xen uses this to map the hypervisor's shared info page, which doesn't
have a pseudo-physical page number, and therefore can't be mapped
ordinarily.  It is needed early because it contains the vcpu state,
including the interrupt mask.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>

---
 include/asm-i386/fixmap.h |3 +++
 1 file changed, 3 insertions(+)

===
--- a/include/asm-i386/fixmap.h
+++ b/include/asm-i386/fixmap.h
@@ -86,6 +86,9 @@ enum fixed_addresses {
 #ifdef CONFIG_PCI_MMCONFIG
FIX_PCIE_MCFG,
 #endif
+#ifdef CONFIG_PARAVIRT
+   FIX_PARAVIRT_BOOTMAP,
+#endif
__end_of_permanent_fixed_addresses,
/* temporary boot-time mappings, used before ioremap() is functional */
 #define NR_FIX_BTMAPS  16

-- 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[patch 15/20] add flush_tlb_others paravirt_op

2007-04-04 Thread Jeremy Fitzhardinge
This patch adds a pv_op for flush_tlb_others.  Linux running on native
hardware uses cross-CPU IPIs to flush the TLB on any CPU which may
have a particular mm's pagetable entries cached in its TLB.  This is
inefficient in a paravirtualized environment, since the hypervisor
knows which real CPUs actually contain cached mappings, which may be a
small subset of a guest's VCPUs.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>

---
 arch/i386/kernel/paravirt.c |1 +
 arch/i386/kernel/smp.c  |   15 ---
 include/asm-i386/paravirt.h |9 +
 include/asm-i386/tlbflush.h |   19 +--
 4 files changed, 35 insertions(+), 9 deletions(-)

===
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -301,6 +301,7 @@ struct paravirt_ops paravirt_ops = {
.flush_tlb_user = native_flush_tlb,
.flush_tlb_kernel = native_flush_tlb_global,
.flush_tlb_single = native_flush_tlb_single,
+   .flush_tlb_others = native_flush_tlb_others,
 
.map_pt_hook = paravirt_nop,
 
===
--- a/arch/i386/kernel/smp.c
+++ b/arch/i386/kernel/smp.c
@@ -256,7 +256,6 @@ static struct mm_struct * flush_mm;
 static struct mm_struct * flush_mm;
 static unsigned long flush_va;
 static DEFINE_SPINLOCK(tlbstate_lock);
-#define FLUSH_ALL  0x
 
 /*
  * We cannot call mmdrop() because we are in interrupt context, 
@@ -338,7 +337,7 @@ fastcall void smp_invalidate_interrupt(s
 
if (flush_mm == per_cpu(cpu_tlbstate, cpu).active_mm) {
if (per_cpu(cpu_tlbstate, cpu).state == TLBSTATE_OK) {
-   if (flush_va == FLUSH_ALL)
+   if (flush_va == TLB_FLUSH_ALL)
local_flush_tlb();
else
__flush_tlb_one(flush_va);
@@ -353,9 +352,11 @@ out:
put_cpu_no_resched();
 }
 
-static void flush_tlb_others(cpumask_t cpumask, struct mm_struct *mm,
-   unsigned long va)
-{
+void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm,
+unsigned long va)
+{
+   cpumask_t cpumask = *cpumaskp;
+
/*
 * A couple of (to be removed) sanity checks:
 *
@@ -417,7 +418,7 @@ void flush_tlb_current_task(void)
 
local_flush_tlb();
if (!cpus_empty(cpu_mask))
-   flush_tlb_others(cpu_mask, mm, FLUSH_ALL);
+   flush_tlb_others(cpu_mask, mm, TLB_FLUSH_ALL);
preempt_enable();
 }
 
@@ -436,7 +437,7 @@ void flush_tlb_mm (struct mm_struct * mm
leave_mm(smp_processor_id());
}
if (!cpus_empty(cpu_mask))
-   flush_tlb_others(cpu_mask, mm, FLUSH_ALL);
+   flush_tlb_others(cpu_mask, mm, TLB_FLUSH_ALL);
 
preempt_enable();
 }
===
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -15,6 +15,7 @@
 
 #ifndef __ASSEMBLY__
 #include 
+#include 
 
 struct thread_struct;
 struct Xgt_desc_struct;
@@ -125,6 +126,8 @@ struct paravirt_ops
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
void (*flush_tlb_single)(unsigned long addr);
+   void (*flush_tlb_others)(const cpumask_t *cpus, struct mm_struct *mm,
+unsigned long va);
 
void (*map_pt_hook)(int type, pte_t *va, u32 pfn);
 
@@ -731,6 +734,12 @@ static inline void __flush_tlb_single(un
PVOP_VCALL1(flush_tlb_single, addr);
 }
 
+static inline void flush_tlb_others(cpumask_t cpumask, struct mm_struct *mm,
+   unsigned long va)
+{
+   PVOP_VCALL3(flush_tlb_others, &cpumask, mm, va);
+}
+
 static inline void paravirt_map_pt_hook(int type, pte_t *va, u32 pfn)
 {
PVOP_VCALL3(map_pt_hook, type, va, pfn);
===
--- a/include/asm-i386/tlbflush.h
+++ b/include/asm-i386/tlbflush.h
@@ -79,10 +79,14 @@
  *  - flush_tlb_range(vma, start, end) flushes a range of pages
  *  - flush_tlb_kernel_range(start, end) flushes a range of kernel pages
  *  - flush_tlb_pgtables(mm, start, end) flushes a range of page tables
+ *  - flush_tlb_others(cpumask, mm, va) flushes a TLBs on other cpus
  *
  * ..but the i386 has somewhat limited tlb flushing capabilities,
  * and page-granular flushes are available only on i486 and up.
  */
+
+#define TLB_FLUSH_ALL  0x
+
 
 #ifndef CONFIG_SMP
 
@@ -110,7 +114,12 @@ static inline void flush_tlb_range(struc
__flush_tlb();
 }
 
-#else
+static inline void native_flush_tlb_others(const cpumask_t *cpumask,
+  struct mm_struct *mm, unsigned long 
va)
+{
+}
+
+#else  /* SMP */
 
 #include 
 
@@

[patch 17/20] add kmap_atomic_pte for mapping highpte pages

2007-04-04 Thread Jeremy Fitzhardinge
Xen and VMI both have special requirements when mapping a highmem pte
page into the kernel address space.  These can be dealt with by adding
a new kmap_atomic_pte() function for mapping highptes, and hooking it
into the paravirt_ops infrastructure.

Xen specifically wants to map the pte page RO, so this patch exposes a
helper function, kmap_atomic_prot, which maps the page with the
specified page protections.

This also adds a kmap_flush_unused() function to clear out the cached
kmap mappings.  Xen needs this to clear out any potential stray RW
mappings of pages which will become part of a pagetable.

[ Zach - vmi.c will need some attention after this patch.  It wasn't
  immediately obvious to me what needs to be done. ]

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Zachary Amsden <[EMAIL PROTECTED]>

---
 arch/i386/kernel/paravirt.c |7 +++
 arch/i386/mm/highmem.c  |9 +++--
 include/asm-i386/highmem.h  |   11 +++
 include/asm-i386/paravirt.h |   13 -
 include/asm-i386/pgtable.h  |4 ++--
 include/linux/highmem.h |6 ++
 mm/highmem.c|9 +
 7 files changed, 54 insertions(+), 5 deletions(-)

===
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -318,6 +319,12 @@ struct paravirt_ops paravirt_ops = {
 
.ptep_get_and_clear = native_ptep_get_and_clear,
 
+#ifdef CONFIG_HIGHPTE
+   .kmap_atomic_pte = native_kmap_atomic_pte,
+#else
+   .kmap_atomic_pte = paravirt_nop,
+#endif
+
 #ifdef CONFIG_X86_PAE
.set_pte_atomic = native_set_pte_atomic,
.set_pte_present = native_set_pte_present,
===
--- a/arch/i386/mm/highmem.c
+++ b/arch/i386/mm/highmem.c
@@ -26,7 +26,7 @@ void kunmap(struct page *page)
  * However when holding an atomic kmap is is not legal to sleep, so atomic
  * kmaps are appropriate for short, tight code paths only.
  */
-void *kmap_atomic(struct page *page, enum km_type type)
+void *kmap_atomic_prot(struct page *page, enum km_type type, pgprot_t prot)
 {
enum fixed_addresses idx;
unsigned long vaddr;
@@ -41,9 +41,14 @@ void *kmap_atomic(struct page *page, enu
return page_address(page);
 
vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
-   set_pte(kmap_pte-idx, mk_pte(page, kmap_prot));
+   set_pte(kmap_pte-idx, mk_pte(page, prot));
 
return (void*) vaddr;
+}
+
+void *kmap_atomic(struct page *page, enum km_type type)
+{
+   return kmap_atomic_prot(page, type, kmap_prot);
 }
 
 void kunmap_atomic(void *kvaddr, enum km_type type)
===
--- a/include/asm-i386/highmem.h
+++ b/include/asm-i386/highmem.h
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /* declarations for highmem.c */
 extern unsigned long highstart_pfn, highend_pfn;
@@ -67,10 +68,20 @@ extern void FASTCALL(kunmap_high(struct 
 
 void *kmap(struct page *page);
 void kunmap(struct page *page);
+void *kmap_atomic_prot(struct page *page, enum km_type type, pgprot_t prot);
 void *kmap_atomic(struct page *page, enum km_type type);
 void kunmap_atomic(void *kvaddr, enum km_type type);
 void *kmap_atomic_pfn(unsigned long pfn, enum km_type type);
 struct page *kmap_atomic_to_page(void *ptr);
+
+static inline void *native_kmap_atomic_pte(struct page *page, enum km_type 
type)
+{
+   return kmap_atomic(page, type);
+}
+
+#ifndef CONFIG_PARAVIRT
+#define kmap_atomic_pte(page, type)kmap_atomic(page, type)
+#endif
 
 #define flush_cache_kmaps()do { } while (0)
 
===
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -16,7 +16,9 @@
 #ifndef __ASSEMBLY__
 #include 
 #include 
-
+#include 
+
+struct page;
 struct thread_struct;
 struct Xgt_desc_struct;
 struct tss_struct;
@@ -143,6 +145,8 @@ struct paravirt_ops
void (*pte_update_defer)(struct mm_struct *mm, unsigned long addr, 
pte_t *ptep);
 
pte_t (*ptep_get_and_clear)(pte_t *ptep);
+
+   void *(*kmap_atomic_pte)(struct page *page, enum km_type type);
 
 #ifdef CONFIG_X86_PAE
void (*set_pte_atomic)(pte_t *ptep, pte_t pteval);
@@ -768,6 +772,13 @@ static inline void paravirt_release_pd(u
PVOP_VCALL1(release_pd, pfn);
 }
 
+static inline void *kmap_atomic_pte(struct page *page, enum km_type type)
+{
+   unsigned long ret;
+   ret = PVOP_CALL2(unsigned long, kmap_atomic_pte, page, type);
+   return (void *)ret;
+}
+
 static inline void pte_update(struct mm_struct *mm, unsigned long addr,
  pte_t *ptep)
 {
===
--- a/include/asm-i386/pgtable.h
+++ b/include/asm-i386/pgtable.h
@@ -481,9 +481,9 @

[patch 01/20] update MAINTAINERS

2007-04-04 Thread Jeremy Fitzhardinge
Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Chris Wright <[EMAIL PROTECTED]>
Cc: Zachary Amsden <[EMAIL PROTECTED]>
Cc: Rusty Russell <[EMAIL PROTECTED]>
---
 MAINTAINERS |   22 ++
 1 file changed, 22 insertions(+)

===
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2583,6 +2583,19 @@ T:   cvs cvs.parisc-linux.org:/var/cvs/lin
 T: cvs cvs.parisc-linux.org:/var/cvs/linux-2.6
 S: Maintained
 
+PARAVIRT_OPS INTERFACE
+P: Jeremy Fitzhardinge
+M: [EMAIL PROTECTED]
+P: Chris Wright
+M: [EMAIL PROTECTED]
+P: Zachary Amsden
+M: [EMAIL PROTECTED]
+P: Rusty Russell
+M: [EMAIL PROTECTED]
+L: virtualization@lists.osdl.org
+L: linux-kernel@vger.kernel.org
+S: Supported
+
 PC87360 HARDWARE MONITORING DRIVER
 P: Jim Cromie
 M: [EMAIL PROTECTED]
@@ -3780,6 +3793,15 @@ L:   linux-x25@vger.kernel.org
 L: linux-x25@vger.kernel.org
 S: Maintained
 
+XEN HYPERVISOR INTERFACE
+P: Jeremy Fitzhardinge
+M: [EMAIL PROTECTED]
+P: Chris Wright
+M: [EMAIL PROTECTED]
+L: virtualization@lists.osdl.org
+L: [EMAIL PROTECTED]
+S: Supported
+
 XFS FILESYSTEM
 P: Silicon Graphics Inc
 P: Tim Shimmin, David Chatterton

-- 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[patch 03/20] use paravirt_nop to consistently mark no-op operations

2007-04-04 Thread Jeremy Fitzhardinge
Add a _paravirt_nop function for use as a stub for no-op operations,
and paravirt_nop #defined void * version to make using it easier
(since all its uses are as a void *).

This is useful to allow the patcher to automatically identify noop
operations so it can simply nop out the callsite.


Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
[mingo] but only as a cleanup of the current open-coded (void *) casts.
My problem with this is that it loses the types. Not that there is much
to check for, but still, this adds some assumptions about how function
calls look like

---
 arch/i386/kernel/paravirt.c |   26 +-
 include/asm-i386/paravirt.h |3 +++
 2 files changed, 16 insertions(+), 13 deletions(-)

===
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -35,7 +35,7 @@
 #include 
 
 /* nop stub */
-static void native_nop(void)
+void _paravirt_nop(void)
 {
 }
 
@@ -490,7 +490,7 @@ struct paravirt_ops paravirt_ops = {
 
.patch = native_patch,
.banner = default_banner,
-   .arch_setup = native_nop,
+   .arch_setup = paravirt_nop,
.memory_setup = machine_specific_memory_setup,
.get_wallclock = native_get_wallclock,
.set_wallclock = native_set_wallclock,
@@ -546,25 +546,25 @@ struct paravirt_ops paravirt_ops = {
.setup_boot_clock = setup_boot_APIC_clock,
.setup_secondary_clock = setup_secondary_APIC_clock,
 #endif
-   .set_lazy_mode = (void *)native_nop,
+   .set_lazy_mode = paravirt_nop,
 
.flush_tlb_user = native_flush_tlb,
.flush_tlb_kernel = native_flush_tlb_global,
.flush_tlb_single = native_flush_tlb_single,
 
-   .map_pt_hook = (void *)native_nop,
-
-   .alloc_pt = (void *)native_nop,
-   .alloc_pd = (void *)native_nop,
-   .alloc_pd_clone = (void *)native_nop,
-   .release_pt = (void *)native_nop,
-   .release_pd = (void *)native_nop,
+   .map_pt_hook = paravirt_nop,
+
+   .alloc_pt = paravirt_nop,
+   .alloc_pd = paravirt_nop,
+   .alloc_pd_clone = paravirt_nop,
+   .release_pt = paravirt_nop,
+   .release_pd = paravirt_nop,
 
.set_pte = native_set_pte,
.set_pte_at = native_set_pte_at,
.set_pmd = native_set_pmd,
-   .pte_update = (void *)native_nop,
-   .pte_update_defer = (void *)native_nop,
+   .pte_update = paravirt_nop,
+   .pte_update_defer = paravirt_nop,
 #ifdef CONFIG_X86_PAE
.set_pte_atomic = native_set_pte_atomic,
.set_pte_present = native_set_pte_present,
@@ -576,7 +576,7 @@ struct paravirt_ops paravirt_ops = {
.irq_enable_sysexit = native_irq_enable_sysexit,
.iret = native_iret,
 
-   .startup_ipi_hook = (void *)native_nop,
+   .startup_ipi_hook = paravirt_nop,
 };
 
 /*
===
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -430,6 +430,9 @@ static inline void pmd_clear(pmd_t *pmdp
 #define arch_enter_lazy_mmu_mode() 
paravirt_ops.set_lazy_mode(PARAVIRT_LAZY_MMU)
 #define arch_leave_lazy_mmu_mode() 
paravirt_ops.set_lazy_mode(PARAVIRT_LAZY_NONE)
 
+void _paravirt_nop(void);
+#define paravirt_nop   ((void *)_paravirt_nop)
+
 /* These all sit in the .parainstructions section to tell us what to patch. */
 struct paravirt_patch {
u8 *instr;  /* original instructions */

-- 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[patch 19/20] Add a sched_clock paravirt_op

2007-04-04 Thread Jeremy Fitzhardinge
The tsc-based get_scheduled_cycles interface is not a good match for
Xen's runstate accounting, which reports everything in nanoseconds.

This patch replaces this interface with a sched_clock interface, which
matches both Xen and VMI's requirements.

In order to do this, we:
   1. replace get_scheduled_cycles with sched_clock
   2. hoist cycles_2_ns into a common header
   3. update vmi accordingly

One thing to note: because sched_clock is implemented as a weak
function in kernel/sched.c, we must define a real function in order to
override this weak binding.  This means the usual paravirt_ops
technique of using an inline function won't work in this case.


Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Zachary Amsden <[EMAIL PROTECTED]>
Cc: Dan Hecht <[EMAIL PROTECTED]>
Cc: john stultz <[EMAIL PROTECTED]>

---
 arch/i386/kernel/paravirt.c|2 -
 arch/i386/kernel/sched-clock.c |   39 ++
 arch/i386/kernel/vmi.c |2 -
 arch/i386/kernel/vmitime.c |6 ++---
 include/asm-i386/paravirt.h|7 --
 include/asm-i386/timer.h   |   45 +++-
 include/asm-i386/vmi_time.h|2 -
 7 files changed, 71 insertions(+), 32 deletions(-)

===
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -269,7 +269,7 @@ struct paravirt_ops paravirt_ops = {
.write_msr = native_write_msr_safe,
.read_tsc = native_read_tsc,
.read_pmc = native_read_pmc,
-   .get_scheduled_cycles = native_read_tsc,
+   .sched_clock = native_sched_clock,
.get_cpu_khz = native_calculate_cpu_khz,
.load_tr_desc = native_load_tr_desc,
.set_ldt = native_set_ldt,
===
--- a/arch/i386/kernel/sched-clock.c
+++ b/arch/i386/kernel/sched-clock.c
@@ -37,26 +37,7 @@
 
 #define CYC2NS_SCALE_FACTOR 10 /* 2^10, carefully chosen */
 
-struct sc_data {
-   unsigned int cyc2ns_scale;
-   unsigned char unstable;
-   unsigned long long last_tsc;
-   unsigned long long ns_base;
-};
-
-static DEFINE_PER_CPU(struct sc_data, sc_data);
-
-static inline unsigned long long cycles_2_ns(unsigned long long cyc)
-{
-   const struct sc_data *sc = &__get_cpu_var(sc_data);
-   unsigned long long ns;
-
-   cyc -= sc->last_tsc;
-   ns = (cyc * sc->cyc2ns_scale) >> CYC2NS_SCALE_FACTOR;
-   ns += sc->ns_base;
-
-   return ns;
-}
+DEFINE_PER_CPU(struct sc_data, sc_data);
 
 /*
  * Scheduler clock - returns current time in nanosec units.
@@ -66,7 +47,7 @@ static inline unsigned long long cycles_
  * [1] no attempt to stop CPU instruction reordering, which can hit
  * in a 100 instruction window or so.
  */
-unsigned long long sched_clock(void)
+unsigned long long native_sched_clock(void)
 {
unsigned long long r;
const struct sc_data *sc = &get_cpu_var(sc_data);
@@ -75,7 +56,7 @@ unsigned long long sched_clock(void)
/* TBD find a cheaper fallback timer than this */
r = ktime_to_ns(ktime_get());
} else {
-   get_scheduled_cycles(r);
+   rdtscll(r);
r = cycles_2_ns(r);
}
 
@@ -83,6 +64,18 @@ unsigned long long sched_clock(void)
 
return r;
 }
+
+/* We need to define a real function for sched_clock, to override the
+   weak default version */
+#ifdef CONFIG_PARAVIRT
+unsigned long long sched_clock(void)
+{
+   return paravirt_sched_clock();
+}
+#else
+unsigned long long sched_clock(void)
+   __attribute__((alias("native_sched_clock")));
+#endif
 
 /* Resync with new CPU frequency */
 static void resync_sc_freq(struct sc_data *sc, unsigned int newfreq)
@@ -95,7 +88,7 @@ static void resync_sc_freq(struct sc_dat
   because sched_clock callers should be able to tolerate small
   errors. */
sc->ns_base = ktime_to_ns(ktime_get());
-   get_scheduled_cycles(sc->last_tsc);
+   rdtscll(sc->last_tsc);
sc->cyc2ns_scale = (100 << CYC2NS_SCALE_FACTOR) / newfreq;
sc->unstable = 0;
 }
===
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -866,7 +866,7 @@ static inline int __init activate_vmi(vo
paravirt_ops.setup_boot_clock = vmi_timer_setup_boot_alarm;
paravirt_ops.setup_secondary_clock = 
vmi_timer_setup_secondary_alarm;
 #endif
-   paravirt_ops.get_scheduled_cycles = vmi_get_sched_cycles;
+   paravirt_ops.sched_clock = vmi_sched_clock;
paravirt_ops.get_cpu_khz = vmi_cpu_khz;
 
/* We have true wallclock functions; disable CMOS clock sync */
===
--- a/arch/i386/kernel/vmitime.c
+++ b/arch/i386/kernel/vmitime.c
@@ -163,9 +163,9 @@ int vmi_set_wallclock(unsigned long now)
  

[patch 07/20] Allow paravirt backend to choose kernel PMD sharing

2007-04-04 Thread Jeremy Fitzhardinge
Normally when running in PAE mode, the 4th PMD maps the kernel address
space, which can be shared among all processes (since they all need
the same kernel mappings).

Xen, however, does not allow guests to have the kernel pmd shared
between page tables, so parameterize pgtable.c to allow both modes of
operation.

There are several side-effects of this.  One is that vmalloc will
update the kernel address space mappings, and those updates need to be
propagated into all processes if the kernel mappings are not
intrinsically shared.  In the non-PAE case, this is done by
maintaining a pgd_list of all processes; this list is used when all
process pagetables must be updated.  pgd_list is threaded via
otherwise unused entries in the page structure for the pgd, which
means that the pgd must be page-sized for this to work.

Normally the PAE pgd is only 4x64 byte entries large, but Xen requires
the PAE pgd to page aligned anyway, so this patch forces the pgd to be
page aligned+sized when the kernel pmd is unshared, to accomodate both
these requirements.

Also, since there may be several distinct kernel pmds (if the
user/kernel split is below 3G), there's no point in allocating them
from a slab cache; they're just allocated with get_free_page and
initialized appropriately.  (Of course the could be cached if there is
just a single kernel pmd - which is the default with a 3G user/kernel
split - but it doesn't seem worthwhile to add yet another case into
this code).

[ Many thanks to wli for review comments. ]

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Signed-off-by: William Lee Irwin III <[EMAIL PROTECTED]>
Cc: Zachary Amsden <[EMAIL PROTECTED]>
Cc: Christoph Lameter <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
---
 arch/i386/kernel/paravirt.c|1 
 arch/i386/mm/fault.c   |6 +-
 arch/i386/mm/init.c|   18 +-
 arch/i386/mm/pageattr.c|2 
 arch/i386/mm/pgtable.c |   84 ++--
 include/asm-i386/paravirt.h|1 
 include/asm-i386/pgtable-2level-defs.h |2 
 include/asm-i386/pgtable-2level.h  |2 
 include/asm-i386/pgtable-3level-defs.h |6 ++
 include/asm-i386/pgtable-3level.h  |2 
 include/asm-i386/pgtable.h |7 ++
 11 files changed, 105 insertions(+), 26 deletions(-)

===
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -604,6 +604,7 @@ struct paravirt_ops paravirt_ops = {
.name = "bare hardware",
.paravirt_enabled = 0,
.kernel_rpl = 0,
+   .shared_kernel_pmd = 1, /* Only used when CONFIG_X86_PAE is set */
 
.patch = native_patch,
.banner = default_banner,
===
--- a/arch/i386/mm/fault.c
+++ b/arch/i386/mm/fault.c
@@ -588,8 +588,7 @@ do_sigbus:
force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
 }
 
-#ifndef CONFIG_X86_PAE
-void vmalloc_sync_all(void)
+void _vmalloc_sync_all(void)
 {
/*
 * Note that races in the updates of insync and start aren't
@@ -600,6 +599,8 @@ void vmalloc_sync_all(void)
static DECLARE_BITMAP(insync, PTRS_PER_PGD);
static unsigned long start = TASK_SIZE;
unsigned long address;
+
+   BUG_ON(SHARED_KERNEL_PMD);
 
BUILD_BUG_ON(TASK_SIZE & ~PGDIR_MASK);
for (address = start; address >= TASK_SIZE; address += PGDIR_SIZE) {
@@ -623,4 +624,3 @@ void vmalloc_sync_all(void)
start = address + PGDIR_SIZE;
}
 }
-#endif
===
--- a/arch/i386/mm/init.c
+++ b/arch/i386/mm/init.c
@@ -715,6 +715,8 @@ struct kmem_cache *pmd_cache;
 
 void __init pgtable_cache_init(void)
 {
+   size_t pgd_size = PTRS_PER_PGD*sizeof(pgd_t);
+
if (PTRS_PER_PMD > 1) {
pmd_cache = kmem_cache_create("pmd",
PTRS_PER_PMD*sizeof(pmd_t),
@@ -724,13 +726,23 @@ void __init pgtable_cache_init(void)
NULL);
if (!pmd_cache)
panic("pgtable_cache_init(): cannot create pmd cache");
+
+   if (!SHARED_KERNEL_PMD) {
+   /* If we're in PAE mode and have a non-shared
+  kernel pmd, then the pgd size must be a
+  page size.  This is because the pgd_list
+  links through the page structure, so there
+  can only be one pgd per page for this to
+  work. */
+   pgd_size = PAGE_SIZE;
+   }
}
pgd_cache = kmem_cache_create("pgd",
-   PTRS_PER_PGD*sizeof(pgd_t),
-   PTRS_PER_PGD*sizeof(pgd_t),
+  

[patch 13/20] Document asm-i386/paravirt.h

2007-04-04 Thread Jeremy Fitzhardinge
Clean things up, and broadly document:
 - the paravirt_ops functions themselves
 - the patching mechanism

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Rusty Russell <[EMAIL PROTECTED]>
 
---
 include/asm-i386/paravirt.h |  140 +--
 1 file changed, 123 insertions(+), 17 deletions(-)

===
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -21,6 +21,14 @@ struct tss_struct;
 struct tss_struct;
 struct mm_struct;
 struct desc_struct;
+
+/* Lazy mode for batching updates / context switch */
+enum paravirt_lazy_mode {
+   PARAVIRT_LAZY_NONE = 0,
+   PARAVIRT_LAZY_MMU = 1,
+   PARAVIRT_LAZY_CPU = 2,
+};
+
 struct paravirt_ops
 {
unsigned int kernel_rpl;
@@ -37,22 +45,33 @@ struct paravirt_ops
 */
unsigned (*patch)(u8 type, u16 clobber, void *firstinsn, unsigned len);
 
+   /* Basic arch-specific setup */
void (*arch_setup)(void);
char *(*memory_setup)(void);
void (*init_IRQ)(void);
-
+   void (*time_init)(void);
+
+   /*
+* Called before/after init_mm pagetable setup. setup_start
+* may reset %cr3, and may pre-install parts of the pagetable;
+* pagetable setup is expected to preserve any existing
+* mapping.
+*/
void (*pagetable_setup_start)(pgd_t *pgd_base);
void (*pagetable_setup_done)(pgd_t *pgd_base);
 
+   /* Print a banner to identify the environment */
void (*banner)(void);
 
+   /* Set and set time of day */
unsigned long (*get_wallclock)(void);
int (*set_wallclock)(unsigned long);
-   void (*time_init)(void);
-
+
+   /* cpuid emulation, mostly so that caps bits can be disabled */
void (*cpuid)(unsigned int *eax, unsigned int *ebx,
  unsigned int *ecx, unsigned int *edx);
 
+   /* hooks for various privileged instructions */
unsigned long (*get_debugreg)(int regno);
void (*set_debugreg)(int regno, unsigned long value);
 
@@ -71,15 +90,23 @@ struct paravirt_ops
unsigned long (*read_cr4)(void);
void (*write_cr4)(unsigned long);
 
+   /*
+* Get/set interrupt state.  save_fl and restore_fl are only
+* expected to use X86_EFLAGS_IF; all other bits
+* returned from save_fl are undefined, and may be ignored by
+* restore_fl.
+*/
unsigned long (*save_fl)(void);
void (*restore_fl)(unsigned long);
void (*irq_disable)(void);
void (*irq_enable)(void);
void (*safe_halt)(void);
void (*halt)(void);
+
void (*wbinvd)(void);
 
-   /* err = 0/-EFAULT.  wrmsr returns 0/-EFAULT. */
+   /* MSR, PMC and TSR operations.
+  err = 0/-EFAULT.  wrmsr returns 0/-EFAULT. */
u64 (*read_msr)(unsigned int msr, int *err);
int (*write_msr)(unsigned int msr, u64 val);
 
@@ -88,6 +115,7 @@ struct paravirt_ops
u64 (*get_scheduled_cycles)(void);
unsigned long (*get_cpu_khz)(void);
 
+   /* Segment descriptor handling */
void (*load_tr_desc)(void);
void (*load_gdt)(const struct Xgt_desc_struct *);
void (*load_idt)(const struct Xgt_desc_struct *);
@@ -105,9 +133,12 @@ struct paravirt_ops
void (*load_esp0)(struct tss_struct *tss, struct thread_struct *t);
 
void (*set_iopl_mask)(unsigned mask);
-
void (*io_delay)(void);
 
+   /*
+* Hooks for intercepting the creation/use/destruction of an
+* mm_struct.
+*/
void (*activate_mm)(struct mm_struct *prev,
struct mm_struct *next);
void (*dup_mmap)(struct mm_struct *oldmm,
@@ -115,30 +146,43 @@ struct paravirt_ops
void (*exit_mmap)(struct mm_struct *mm);
 
 #ifdef CONFIG_X86_LOCAL_APIC
+   /*
+* Direct APIC operations, principally for VMI.  Ideally
+* these shouldn't be in this interface.
+*/
void (*apic_write)(unsigned long reg, unsigned long v);
void (*apic_write_atomic)(unsigned long reg, unsigned long v);
unsigned long (*apic_read)(unsigned long reg);
void (*setup_boot_clock)(void);
void (*setup_secondary_clock)(void);
+
+   void (*startup_ipi_hook)(int phys_apicid,
+unsigned long start_eip,
+unsigned long start_esp);
 #endif
 
+   /* TLB operations */
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
void (*flush_tlb_single)(unsigned long addr);
 
void (*map_pt_hook)(int type, pte_t *va, u32 pfn);
 
+   /* Hooks for allocating/releasing pagetable pages */
void (*alloc_pt)(u32 pfn);
void (*alloc_pd)(u32 pfn);
void (*alloc_pd_clone)(u32 pfn, u32 clonepfn, u32 start, u32 count);
void (*release_pt)(u32 pfn);
void (*release_pd)(u32 pfn);
 
+ 

[patch 12/20] Consistently wrap paravirt ops callsites to make them patchable

2007-04-04 Thread Jeremy Fitzhardinge
Wrap a set of interesting paravirt_ops calls in a wrapper which makes
the callsites available for patching.  Unfortunately this is pretty
ugly because there's no way to get gcc to generate a function call,
but also wrap just the callsite itself with the necessary labels.

This patch supports functions with 0-4 arguments, and either void or
returning a value.  64-bit arguments must be split into a pair of
32-bit arguments (lower word first).  Small structures are returned in
registers.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Rusty Russell <[EMAIL PROTECTED]>
Cc: Zachary Amsden <[EMAIL PROTECTED]>
Cc: Anthony Liguori <[EMAIL PROTECTED]>

---
 include/asm-i386/paravirt.h |  715 ++-
 1 file changed, 569 insertions(+), 146 deletions(-)

===
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -124,7 +124,7 @@ struct paravirt_ops
 
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
-   void (*flush_tlb_single)(u32 addr);
+   void (*flush_tlb_single)(unsigned long addr);
 
void (*map_pt_hook)(int type, pte_t *va, u32 pfn);
 
@@ -188,7 +188,7 @@ extern struct paravirt_ops paravirt_ops;
 #define paravirt_clobber(clobber)  \
[paravirt_clobber] "i" (clobber)
 
-#define PARAVIRT_CALL  "call *paravirt_ops+%c[paravirt_typenum]*4;"
+#define PARAVIRT_CALL  "call *(paravirt_ops+%c[paravirt_typenum]*4);"
 
 #define _paravirt_alt(insn_string, type, clobber)  \
"771:\n\t" insn_string "\n" "772:\n"\
@@ -199,26 +199,234 @@ extern struct paravirt_ops paravirt_ops;
"  .short " clobber "\n"\
".popsection\n"
 
-#define paravirt_alt(insn_string)  \
+#define paravirt_alt(insn_string)  \
_paravirt_alt(insn_string, "%c[paravirt_typenum]", 
"%c[paravirt_clobber]")
 
-#define paravirt_enabled() (paravirt_ops.paravirt_enabled)
+#define PVOP_CALL0(__rettype, __op)\
+   ({  \
+   __rettype __ret;\
+   if (sizeof(__rettype) > sizeof(unsigned long)) {\
+   unsigned long long __tmp;   \
+   unsigned long __ecx;\
+   asm volatile(paravirt_alt(PARAVIRT_CALL)\
+: "=A" (__tmp), "=c" (__ecx)   \
+: paravirt_type(__op), \
+  paravirt_clobber(CLBR_ANY)   \
+: "memory", "cc"); \
+   __ret = (__rettype)__tmp;   \
+   } else {\
+   unsigned long __tmp, __edx, __ecx;  \
+   asm volatile(paravirt_alt(PARAVIRT_CALL)\
+: "=a" (__tmp), "=d" (__edx),  \
+  "=c" (__ecx) \
+: paravirt_type(__op), \
+  paravirt_clobber(CLBR_ANY)   \
+: "memory", "cc"); \
+   __ret = (__rettype)__tmp;   \
+   }   \
+   __ret;  \
+   })
+#define PVOP_VCALL0(__op)  \
+   ({  \
+   unsigned long __eax, __edx, __ecx;  \
+   asm volatile(paravirt_alt(PARAVIRT_CALL)\
+: "=a" (__eax), "=d" (__edx), "=c" (__ecx) \
+: paravirt_type(__op), \
+  paravirt_clobber(CLBR_ANY)   \
+: "memory", "cc"); \
+   })
+
+#define PVOP_CALL1(__rettype, __op, arg1)  \
+   ({  \
+   __rettype __ret;\
+   if (sizeof(__rettype) > sizeof(unsigned long)) {\
+   unsigned long long __tmp;   \
+   unsigned long __ecx;\
+   asm volatile(paravirt_alt(PARAVIRT_CALL)\
+: "=A" (__tmp), "=c" (__ecx)   \
+   

[patch 02/20] Remove CONFIG_DEBUG_PARAVIRT

2007-04-04 Thread Jeremy Fitzhardinge
Remove CONFIG_DEBUG_PARAVIRT.  When inlining code, this option
attempts to trash registers in the patch-site's "clobber" field, on
the grounds that this should find bugs with incorrect clobbers.
Unfortunately, the clobber field really means "registers modified by
this patch site", which includes return values.

Because of this, this option has outlived its usefulness, so remove
it.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Rusty Russell <[EMAIL PROTECTED]>

---
 arch/i386/Kconfig.debug|   10 --
 arch/i386/kernel/alternative.c |   14 +-
 2 files changed, 1 insertion(+), 23 deletions(-)

===
--- a/arch/i386/Kconfig.debug
+++ b/arch/i386/Kconfig.debug
@@ -85,14 +85,4 @@ config DOUBLEFAULT
   option saves about 4k and might cause you much additional grey
   hair.
 
-config DEBUG_PARAVIRT
-   bool "Enable some paravirtualization debugging"
-   default n
-   depends on PARAVIRT && DEBUG_KERNEL
-   help
- Currently deliberately clobbers regs which are allowed to be
- clobbered in inlined paravirt hooks, even in native mode.
- If turning this off solves a problem, then DISABLE_INTERRUPTS() or
- ENABLE_INTERRUPTS() is lying about what registers can be clobbered.
-
 endmenu
===
--- a/arch/i386/kernel/alternative.c
+++ b/arch/i386/kernel/alternative.c
@@ -359,19 +359,7 @@ void apply_paravirt(struct paravirt_patc
 
used = paravirt_ops.patch(p->instrtype, p->clobbers, p->instr,
  p->len);
-#ifdef CONFIG_DEBUG_PARAVIRT
-   {
-   int i;
-   /* Deliberately clobber regs using "not %reg" to find bugs. */
-   for (i = 0; i < 3; i++) {
-   if (p->len - used >= 2 && (p->clobbers & (1 << i))) {
-   memcpy(p->instr + used, "\xf7\xd0", 2);
-   p->instr[used+1] |= i;
-   used += 2;
-   }
-   }
-   }
-#endif
+
/* Pad the rest with nops */
nop_out(p->instr + used, p->len - used);
}

-- 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[patch 11/20] Fix patch site clobbers to include return register

2007-04-04 Thread Jeremy Fitzhardinge
Fix a few clobbers to include the return register.  The clobbers set
is the set of all registers modified (or may be modified) by the code
snippet, regardless of whether it was deliberate or accidental.

Also, make sure that callsites which are used in contexts which don't
allow clobbers actually save and restore all clobberable registers.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Rusty Russell <[EMAIL PROTECTED]>
Cc: Zachary Amsden <[EMAIL PROTECTED]>

---
 arch/i386/kernel/entry.S|2 +-
 include/asm-i386/paravirt.h |   18 ++
 2 files changed, 11 insertions(+), 9 deletions(-)

===
--- a/arch/i386/kernel/entry.S
+++ b/arch/i386/kernel/entry.S
@@ -342,7 +342,7 @@ 1:  movl (%ebp),%ebp
jae syscall_badsys
call *sys_call_table(,%eax,4)
movl %eax,PT_EAX(%esp)
-   DISABLE_INTERRUPTS(CLBR_ECX|CLBR_EDX)
+   DISABLE_INTERRUPTS(CLBR_ANY)
TRACE_IRQS_OFF
movl TI_flags(%ebp), %ecx
testw $_TIF_ALLWORK_MASK, %cx
===
--- a/include/asm-i386/paravirt.h
+++ b/include/asm-i386/paravirt.h
@@ -532,7 +532,7 @@ static inline unsigned long __raw_local_
  "popl %%edx; popl %%ecx")
 : "=a"(f)
 : paravirt_type(save_fl),
-  paravirt_clobber(CLBR_NONE)
+  paravirt_clobber(CLBR_EAX)
 : "memory", "cc");
return f;
 }
@@ -617,27 +617,29 @@ 772:; \
.popsection
 
 #define INTERRUPT_RETURN   \
-   PARA_SITE(PARA_PATCH(PARAVIRT_iret), CLBR_ANY,  \
+   PARA_SITE(PARA_PATCH(PARAVIRT_iret), CLBR_NONE, \
  jmp *%cs:paravirt_ops+PARAVIRT_iret)
 
 #define DISABLE_INTERRUPTS(clobbers)   \
PARA_SITE(PARA_PATCH(PARAVIRT_irq_disable), clobbers,   \
- pushl %ecx; pushl %edx;   \
+ pushl %eax; pushl %ecx; pushl %edx;   \
  call *%cs:paravirt_ops+PARAVIRT_irq_disable;  \
- popl %edx; popl %ecx) \
+ popl %edx; popl %ecx; popl %eax)  \
 
 #define ENABLE_INTERRUPTS(clobbers)\
PARA_SITE(PARA_PATCH(PARAVIRT_irq_enable), clobbers,\
- pushl %ecx; pushl %edx;   \
+ pushl %eax; pushl %ecx; pushl %edx;   \
  call *%cs:paravirt_ops+PARAVIRT_irq_enable;   \
- popl %edx; popl %ecx)
+ popl %edx; popl %ecx; popl %eax)
 
 #define ENABLE_INTERRUPTS_SYSEXIT  \
-   PARA_SITE(PARA_PATCH(PARAVIRT_irq_enable_sysexit), CLBR_ANY,\
+   PARA_SITE(PARA_PATCH(PARAVIRT_irq_enable_sysexit), CLBR_NONE,   \
  jmp *%cs:paravirt_ops+PARAVIRT_irq_enable_sysexit)
 
 #define GET_CR0_INTO_EAX   \
-   call *paravirt_ops+PARAVIRT_read_cr0
+   push %ecx; push %edx;   \
+   call *paravirt_ops+PARAVIRT_read_cr0;   \
+   pop %edx; pop %ecx
 
 #endif /* __ASSEMBLY__ */
 #endif /* CONFIG_PARAVIRT */

-- 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[patch 00/20] paravirt_ops updates

2007-04-04 Thread Jeremy Fitzhardinge
Hi Andi,

Here's a repost of the paravirt_ops update series I posted the other day.
Since then, I found a few potential bugs with patching clobbering,
cleaned up and documented paravirt.h and the patching machinery.

Overview:

add-MAINTAINERS.patch
obvious

remove-CONFIG_DEBUG_PARAVIRT.patch
No longer meaningful or needed.

paravirt-nop.patch
Clean up nop paravirt_ops functions, mainly to allow the patching
machinery to easily identify them.

paravirt-pte-accessors.patch
Accessors to allow pv_ops to control the content of pagetable entries.

paravirt-memory-init.patch
Hook into initial pagetable creation.

paravirt-fixmap.patch
Create a fixmap for early paravirt_ops mappings.

shared-kernel-pmd.patch
Make the choice of whether the kernel pmd is shared between
processes or not a runtime selectable flag.

mm-lifetime-hooks.patch
Hooks to allow the creation, use and destruction of an mm_struct
to be followed.

paravirt-patch-rename-paravirt_patch.patch
Rename a structure to make its use a bit more clear.

paravirt-use-offset-site-ids.patch
Use the offsetof each function pointer in paravirt_ops as the
basis of its patching identifier.

paravirt-fix-clobbers.patch
Fix up various register/use clobber problems.  This may be 2.6.21
material, but I don't think it will materially affect VMI.

paravirt-patchable-call-wrappers.patch
Wrap each paravirt_ops call to allow the callsites to be runtime
patched.

paravirt-document-paravirt_ops.patch
Document the paravirt_ops structure itself, the patching
mechanism, and other cleanups.

paravirt-patch-machinery.patch
General patch machinery for use by pv_ops backends to implment
patching.

paravirt-flush_tlb_others.patch
Add a hook for cross-cpu tlb flushing.

revert-map_pt_hook.patch
Back out the map_pt_hook change.

paravirt-kmap_atomic_pte.patch
Replace map_pt_hook with kmap_atomic_pte.

cleanup-tsc-sched-clock.patch
Clean up the tsc-based sched_clock.  (I think you already
have this.)

paravirt-sched-clock.patch
Add a hook for sched_clock, so that paravirt_ops backends can
report unstolen time for use as the scheduler clock.

apply-to-page-range.patch
Apply a function to a range of pagetable entries.

Thanks,
J

-- 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[patch 14/20] add common patching machinery

2007-04-04 Thread Jeremy Fitzhardinge
Implement the actual patching machinery.  paravirt_patch_default()
contains the logic to automatically patch a callsite based on a few
simple rules:

 - if the paravirt_op function is paravirt_nop, then patch nops
 - if the paravirt_op function is a jmp target, then jmp to it
 - if the paravirt_op function is callable and doesn't clobber too much
for the callsite, call it directly

paravirt_patch_default is suitable as a default implementation of
paravirt_ops.patch, will remove most of the expensive indirect calls
in favour of either a direct call or a pile of nops.

Backends may implement their own patcher, however.  There are several
helper functions to help with this:

paravirt_patch_nop  nop out a callsite
paravirt_patch_ignore   leave the callsite as-is
paravirt_patch_call patch a call if the caller and callee
have compatible clobbers
paravirt_patch_jmp  patch in a jmp
paravirt_patch_insnspatch some literal instructions over
the callsite, if they fit

This patch also implements more direct patches for the native case, so
that when running on native hardware many common operations are
implemented inline.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Rusty Russell <[EMAIL PROTECTED]>
Cc: Zachary Amsden <[EMAIL PROTECTED]>
Cc: Anthony Liguori <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>
---
 arch/i386/kernel/alternative.c |5 -
 arch/i386/kernel/paravirt.c|  164 
 include/asm-i386/paravirt.h|   12 ++
 3 files changed, 149 insertions(+), 32 deletions(-)

===
--- a/arch/i386/kernel/alternative.c
+++ b/arch/i386/kernel/alternative.c
@@ -349,11 +349,14 @@ void apply_paravirt(struct paravirt_patc
used = paravirt_ops.patch(p->instrtype, p->clobbers, p->instr,
  p->len);
 
+   BUG_ON(used > p->len);
+
/* Pad the rest with nops */
nop_out(p->instr + used, p->len - used);
}
 
-   /* Sync to be conservative, in case we patched following instructions */
+   /* Sync to be conservative, in case we patched following
+  instructions */
sync_core();
 }
 #endif /* CONFIG_PARAVIRT */
===
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -54,40 +54,142 @@ char *memory_setup(void)
 #define DEF_NATIVE(name, code) \
extern const char start_##name[], end_##name[]; \
asm("start_" #name ": " code "; end_" #name ":")
-DEF_NATIVE(cli, "cli");
-DEF_NATIVE(sti, "sti");
-DEF_NATIVE(popf, "push %eax; popf");
-DEF_NATIVE(pushf, "pushf; pop %eax");
+
+DEF_NATIVE(irq_disable, "cli");
+DEF_NATIVE(irq_enable, "sti");
+DEF_NATIVE(restore_fl, "push %eax; popf");
+DEF_NATIVE(save_fl, "pushf; pop %eax");
 DEF_NATIVE(iret, "iret");
-DEF_NATIVE(sti_sysexit, "sti; sysexit");
-
-static const struct native_insns
-{
-   const char *start, *end;
-} native_insns[] = {
-   [PARAVIRT_PATCH(irq_disable)] = { start_cli, end_cli },
-   [PARAVIRT_PATCH(irq_enable)] = { start_sti, end_sti },
-   [PARAVIRT_PATCH(restore_fl)] = { start_popf, end_popf },
-   [PARAVIRT_PATCH(save_fl)] = { start_pushf, end_pushf },
-   [PARAVIRT_PATCH(iret)] = { start_iret, end_iret },
-   [PARAVIRT_PATCH(irq_enable_sysexit)] = { start_sti_sysexit, 
end_sti_sysexit },
-};
+DEF_NATIVE(irq_enable_sysexit, "sti; sysexit");
+DEF_NATIVE(read_cr2, "mov %cr2, %eax");
+DEF_NATIVE(write_cr3, "mov %eax, %cr3");
+DEF_NATIVE(read_cr3, "mov %cr3, %eax");
+DEF_NATIVE(clts, "clts");
+DEF_NATIVE(read_tsc, "rdtsc");
+
+DEF_NATIVE(ud2a, "ud2a");
 
 static unsigned native_patch(u8 type, u16 clobbers, void *insns, unsigned len)
 {
-   unsigned int insn_len;
-
-   /* Don't touch it if we don't have a replacement */
-   if (type >= ARRAY_SIZE(native_insns) || !native_insns[type].start)
-   return len;
-
-   insn_len = native_insns[type].end - native_insns[type].start;
-
-   /* Similarly if we can't fit replacement. */
-   if (len < insn_len)
-   return len;
-
-   memcpy(insns, native_insns[type].start, insn_len);
+   const unsigned char *start, *end;
+   unsigned ret;
+
+   switch(type) {
+#define SITE(x)case PARAVIRT_PATCH(x): start = start_##x; end = 
end_##x; goto patch_site
+   SITE(irq_disable);
+   SITE(irq_enable);
+   SITE(restore_fl);
+   SITE(save_fl);
+   SITE(iret);
+   SITE(irq_enable_sysexit);
+   SITE(read_cr2);
+   SITE(read_cr3);
+   SITE(write_cr3);
+   SITE(clts);
+   SITE(read_tsc);
+#undef SITE
+
+   patch_site:
+   ret = paravirt_patch_insns(insns, len, start

[patch 05/20] Hooks to set up initial pagetable

2007-04-04 Thread Jeremy Fitzhardinge
This patch introduces paravirt_ops hooks to control how the kernel's
initial pagetable is set up.

In the case of a native boot, the very early bootstrap code creates a
simple non-PAE pagetable to map the kernel and physical memory.  When
the VM subsystem is initialized, it creates a proper pagetable which
respects the PAE mode, large pages, etc.

When booting under a hypervisor, there are many possibilities for what
paging environment the hypervisor establishes for the guest kernel, so
the constructon of the kernel's pagetable depends on the hypervisor.

In the case of Xen, the hypervisor boots the kernel with a fully
constructed pagetable, which is already using PAE if necessary.  Also,
Xen requires particular care when constructing pagetables to make sure
all pagetables are always mapped read-only.

In order to make this easier, kernel's initial pagetable construction
has been changed to only allocate and initialize a pagetable page if
there's no page already present in the pagetable.  This allows the Xen
paravirt backend to make a copy of the hypervisor-provided pagetable,
allowing the kernel to establish any more mappings it needs while
keeping the existing ones.

A slightly subtle point which is worth highlighting here is that Xen
requires all kernel mappings to share the same pte_t pages between all
pagetables, so that updating a kernel page's mapping in one pagetable
is reflected in all other pagetables.  This makes it possible to
allocate a page and attach it to a pagetable without having to
explicitly enumerate that page's mapping in all pagetables.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Acked-by: William Irwin <[EMAIL PROTECTED]>
Cc: Ingo Molnar <[EMAIL PROTECTED]>

---
 arch/i386/kernel/paravirt.c |3 
 arch/i386/mm/init.c |  158 +--
 include/asm-i386/paravirt.h |   17 
 include/asm-i386/pgtable.h  |   16 
 4 files changed, 142 insertions(+), 52 deletions(-)

===
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -476,6 +476,9 @@ struct paravirt_ops paravirt_ops = {
 #endif
.set_lazy_mode = paravirt_nop,
 
+   .pagetable_setup_start = native_pagetable_setup_start,
+   .pagetable_setup_done = native_pagetable_setup_done,
+
.flush_tlb_user = native_flush_tlb,
.flush_tlb_kernel = native_flush_tlb_global,
.flush_tlb_single = native_flush_tlb_single,
===
--- a/arch/i386/mm/init.c
+++ b/arch/i386/mm/init.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 unsigned int __VMALLOC_RESERVE = 128 << 20;
 
@@ -62,6 +63,7 @@ static pmd_t * __init one_md_table_init(

 #ifdef CONFIG_X86_PAE
pmd_table = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE);
+
paravirt_alloc_pd(__pa(pmd_table) >> PAGE_SHIFT);
set_pgd(pgd, __pgd(__pa(pmd_table) | _PAGE_PRESENT));
pud = pud_offset(pgd, 0);
@@ -83,12 +85,10 @@ static pte_t * __init one_page_table_ini
 {
if (pmd_none(*pmd)) {
pte_t *page_table = (pte_t *) 
alloc_bootmem_low_pages(PAGE_SIZE);
+
paravirt_alloc_pt(__pa(page_table) >> PAGE_SHIFT);
set_pmd(pmd, __pmd(__pa(page_table) | _PAGE_TABLE));
-   if (page_table != pte_offset_kernel(pmd, 0))
-   BUG();  
-
-   return page_table;
+   BUG_ON(page_table != pte_offset_kernel(pmd, 0));
}

return pte_offset_kernel(pmd, 0);
@@ -119,7 +119,7 @@ static void __init page_table_range_init
pgd = pgd_base + pgd_idx;
 
for ( ; (pgd_idx < PTRS_PER_PGD) && (vaddr != end); pgd++, pgd_idx++) {
-   if (pgd_none(*pgd)) 
+   if (!(pgd_val(*pgd) & _PAGE_PRESENT))
one_md_table_init(pgd);
pud = pud_offset(pgd, vaddr);
pmd = pmd_offset(pud, vaddr);
@@ -158,7 +158,11 @@ static void __init kernel_physical_mappi
pfn = 0;
 
for (; pgd_idx < PTRS_PER_PGD; pgd++, pgd_idx++) {
-   pmd = one_md_table_init(pgd);
+   if (!(pgd_val(*pgd) & _PAGE_PRESENT))
+   pmd = one_md_table_init(pgd);
+   else
+   pmd = pmd_offset(pud_offset(pgd, PAGE_OFFSET), 
PAGE_OFFSET);
+
if (pfn >= max_low_pfn)
continue;
for (pmd_idx = 0; pmd_idx < PTRS_PER_PMD && pfn < max_low_pfn; 
pmd++, pmd_idx++) {
@@ -167,20 +171,26 @@ static void __init kernel_physical_mappi
/* Map with big pages if possible, otherwise create 
normal page tables. */
if (cpu_has_pse) {
unsigned int address2 = (pfn + PTRS_PER_PTE - 
1) * PAGE_SIZE + PAGE_OFFSET + PAGE_SIZE-1;
-
-   if (is_kernel_text(address) |

[patch 20/20] Add apply_to_page_range() which applies a function to a pte range.

2007-04-04 Thread Jeremy Fitzhardinge
Add a new mm function apply_to_page_range() which applies a given
function to every pte in a given virtual address range in a given mm
structure. This is a generic alternative to cut-and-pasting the Linux
idiomatic pagetable walking code in every place that a sequence of
PTEs must be accessed.

Although this interface is intended to be useful in a wide range of
situations, it is currently used specifically by several Xen
subsystems, for example: to ensure that pagetables have been allocated
for a virtual address range, and to construct batched special
pagetable update requests to map I/O memory (in ioremap()).

Signed-off-by: Ian Pratt <[EMAIL PROTECTED]>
Signed-off-by: Christian Limpach <[EMAIL PROTECTED]>
Signed-off-by: Chris Wright <[EMAIL PROTECTED]>
Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Christoph Lameter <[EMAIL PROTECTED]>
Cc: Matt Mackall <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]> 

---
 include/linux/mm.h |5 ++
 mm/memory.c|   94 
 2 files changed, 99 insertions(+)

===
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1135,6 +1135,11 @@ struct page *follow_page(struct vm_area_
 #define FOLL_GET   0x04/* do get_page on page */
 #define FOLL_ANON  0x08/* give ZERO_PAGE if no pgtable */
 
+typedef int (*pte_fn_t)(pte_t *pte, struct page *pmd_page, unsigned long addr,
+   void *data);
+extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
+  unsigned long size, pte_fn_t fn, void *data);
+
 #ifdef CONFIG_PROC_FS
 void vm_stat_account(struct mm_struct *, unsigned long, struct file *, long);
 #else
===
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1448,6 +1448,100 @@ int remap_pfn_range(struct vm_area_struc
 }
 EXPORT_SYMBOL(remap_pfn_range);
 
+static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
+unsigned long addr, unsigned long end,
+pte_fn_t fn, void *data)
+{
+   pte_t *pte;
+   int err;
+   struct page *pmd_page;
+   spinlock_t *ptl;
+
+   pte = (mm == &init_mm) ?
+   pte_alloc_kernel(pmd, addr) :
+   pte_alloc_map_lock(mm, pmd, addr, &ptl);
+   if (!pte)
+   return -ENOMEM;
+
+   BUG_ON(pmd_huge(*pmd));
+
+   pmd_page = pmd_page(*pmd);
+
+   do {
+   err = fn(pte, pmd_page, addr, data);
+   if (err)
+   break;
+   } while (pte++, addr += PAGE_SIZE, addr != end);
+
+   if (mm != &init_mm)
+   pte_unmap_unlock(pte-1, ptl);
+   return err;
+}
+
+static int apply_to_pmd_range(struct mm_struct *mm, pud_t *pud,
+unsigned long addr, unsigned long end,
+pte_fn_t fn, void *data)
+{
+   pmd_t *pmd;
+   unsigned long next;
+   int err;
+
+   pmd = pmd_alloc(mm, pud, addr);
+   if (!pmd)
+   return -ENOMEM;
+   do {
+   next = pmd_addr_end(addr, end);
+   err = apply_to_pte_range(mm, pmd, addr, next, fn, data);
+   if (err)
+   break;
+   } while (pmd++, addr = next, addr != end);
+   return err;
+}
+
+static int apply_to_pud_range(struct mm_struct *mm, pgd_t *pgd,
+unsigned long addr, unsigned long end,
+pte_fn_t fn, void *data)
+{
+   pud_t *pud;
+   unsigned long next;
+   int err;
+
+   pud = pud_alloc(mm, pgd, addr);
+   if (!pud)
+   return -ENOMEM;
+   do {
+   next = pud_addr_end(addr, end);
+   err = apply_to_pmd_range(mm, pud, addr, next, fn, data);
+   if (err)
+   break;
+   } while (pud++, addr = next, addr != end);
+   return err;
+}
+
+/*
+ * Scan a region of virtual memory, filling in page tables as necessary
+ * and calling a provided function on each leaf page table.
+ */
+int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
+   unsigned long size, pte_fn_t fn, void *data)
+{
+   pgd_t *pgd;
+   unsigned long next;
+   unsigned long end = addr + size;
+   int err;
+
+   BUG_ON(addr >= end);
+   pgd = pgd_offset(mm, addr);
+   do {
+   next = pgd_addr_end(addr, end);
+   err = apply_to_pud_range(mm, pgd, addr, next, fn, data);
+   if (err)
+   break;
+   } while (pgd++, addr = next, addr != end);
+   return err;
+}
+EXPORT_SYMBOL_GPL(apply_to_page_range);
+
 /*
  * handle_pte_fault chooses page fault handler according to an entry
  * which was read non-atomically.  Before making any commitment, on

-- 

[patch 10/20] Use patch site IDs computed from offset in paravirt_ops structure

2007-04-04 Thread Jeremy Fitzhardinge
Use patch type identifiers derived from the offset of the operation in
the paravirt_ops structure.  This avoids having to maintain a separate
enum for patch site types.

Also, since the identifier is derived from the offset into
paravirt_ops, the offset can be derived from the identifier.  This is
used to remove replicated information in the various callsite macros,
which has been a source of bugs in the past.

This patch also drops the fused save_fl+cli operation, which doesn't
really add much and makes things more complex - specifically because
it breaks the 1:1 relationship between identifiers and offsets.  If
this operation turns out to be particularly beneficial, then the right
answer is to define a new entrypoint for it.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Rusty Russell <[EMAIL PROTECTED]>
Cc: Zachary Amsden <[EMAIL PROTECTED]>

---
 arch/i386/kernel/paravirt.c |   14 +--
 arch/i386/kernel/vmi.c  |   39 +
 include/asm-i386/paravirt.h |  179 ++-
 3 files changed, 105 insertions(+), 127 deletions(-)

===
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -58,7 +58,6 @@ DEF_NATIVE(sti, "sti");
 DEF_NATIVE(sti, "sti");
 DEF_NATIVE(popf, "push %eax; popf");
 DEF_NATIVE(pushf, "pushf; pop %eax");
-DEF_NATIVE(pushf_cli, "pushf; pop %eax; cli");
 DEF_NATIVE(iret, "iret");
 DEF_NATIVE(sti_sysexit, "sti; sysexit");
 
@@ -66,13 +65,12 @@ static const struct native_insns
 {
const char *start, *end;
 } native_insns[] = {
-   [PARAVIRT_IRQ_DISABLE] = { start_cli, end_cli },
-   [PARAVIRT_IRQ_ENABLE] = { start_sti, end_sti },
-   [PARAVIRT_RESTORE_FLAGS] = { start_popf, end_popf },
-   [PARAVIRT_SAVE_FLAGS] = { start_pushf, end_pushf },
-   [PARAVIRT_SAVE_FLAGS_IRQ_DISABLE] = { start_pushf_cli, end_pushf_cli },
-   [PARAVIRT_INTERRUPT_RETURN] = { start_iret, end_iret },
-   [PARAVIRT_STI_SYSEXIT] = { start_sti_sysexit, end_sti_sysexit },
+   [PARAVIRT_PATCH(irq_disable)] = { start_cli, end_cli },
+   [PARAVIRT_PATCH(irq_enable)] = { start_sti, end_sti },
+   [PARAVIRT_PATCH(restore_fl)] = { start_popf, end_popf },
+   [PARAVIRT_PATCH(save_fl)] = { start_pushf, end_pushf },
+   [PARAVIRT_PATCH(iret)] = { start_iret, end_iret },
+   [PARAVIRT_PATCH(irq_enable_sysexit)] = { start_sti_sysexit, 
end_sti_sysexit },
 };
 
 static unsigned native_patch(u8 type, u16 clobbers, void *insns, unsigned len)
===
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -78,11 +78,6 @@ static struct {
 #define MNEM_JMP  0xe9
 #define MNEM_RET  0xc3
 
-static char irq_save_disable_callout[] = {
-   MNEM_CALL, 0, 0, 0, 0,
-   MNEM_CALL, 0, 0, 0, 0,
-   MNEM_RET
-};
 #define IRQ_PATCH_INT_MASK 0
 #define IRQ_PATCH_DISABLE  5
 
@@ -130,33 +125,17 @@ static unsigned vmi_patch(u8 type, u16 c
 static unsigned vmi_patch(u8 type, u16 clobbers, void *insns, unsigned len)
 {
switch (type) {
-   case PARAVIRT_IRQ_DISABLE:
+   case PARAVIRT_PATCH(irq_disable):
return patch_internal(VMI_CALL_DisableInterrupts, len, 
insns);
-   case PARAVIRT_IRQ_ENABLE:
+   case PARAVIRT_PATCH(irq_enable):
return patch_internal(VMI_CALL_EnableInterrupts, len, 
insns);
-   case PARAVIRT_RESTORE_FLAGS:
+   case PARAVIRT_PATCH(restore_fl):
return patch_internal(VMI_CALL_SetInterruptMask, len, 
insns);
-   case PARAVIRT_SAVE_FLAGS:
+   case PARAVIRT_PATCH(save_fl):
return patch_internal(VMI_CALL_GetInterruptMask, len, 
insns);
-   case PARAVIRT_SAVE_FLAGS_IRQ_DISABLE:
-   if (len >= 10) {
-   patch_internal(VMI_CALL_GetInterruptMask, len, 
insns);
-   patch_internal(VMI_CALL_DisableInterrupts, 
len-5, insns+5);
-   return 10;
-   } else {
-   /*
-* You bastards didn't leave enough room to
-* patch save_flags_irq_disable inline.  Patch
-* to a helper
-*/
-   BUG_ON(len < 5);
-   *(char *)insns = MNEM_CALL;
-   patch_offset(insns, irq_save_disable_callout);
-   return 5;
-   }
-   case PARAVIRT_INTERRUPT_RETURN:
+   case PARAVIRT_PATCH(iret):
return patch_internal(VMI_CALL_IRET, len, insns);
-   case PARAVIRT_STI_SYSEXIT:
+   case PARAVIRT_PATCH(irq_enable_sysexit):
r

[patch 04/20] Add pagetable accessors to pack and unpack pagetable entries

2007-04-04 Thread Jeremy Fitzhardinge
Add a set of accessors to pack, unpack and modify page table entries
(at all levels).  This allows a paravirt implementation to control the
contents of pgd/pmd/pte entries.  For example, Xen uses this to
convert the (pseudo-)physical address into a machine address when
populating a pagetable entry, and converting back to pphys address
when an entry is read.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Acked-by: Ingo Molnar <[EMAIL PROTECTED]>

---
 arch/i386/kernel/paravirt.c   |   84 +
 arch/i386/kernel/vmi.c|6 +-
 include/asm-i386/page.h   |   79 +-
 include/asm-i386/paravirt.h   |   52 +-
 include/asm-i386/pgtable-2level.h |   28 +---
 include/asm-i386/pgtable-3level.h |   65 +---
 include/asm-i386/pgtable.h|2 
 7 files changed, 186 insertions(+), 130 deletions(-)

===
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -399,78 +399,6 @@ static void native_flush_tlb_single(u32 
 {
__native_flush_tlb_single(addr);
 }
-
-#ifndef CONFIG_X86_PAE
-static void native_set_pte(pte_t *ptep, pte_t pteval)
-{
-   *ptep = pteval;
-}
-
-static void native_set_pte_at(struct mm_struct *mm, u32 addr, pte_t *ptep, 
pte_t pteval)
-{
-   *ptep = pteval;
-}
-
-static void native_set_pmd(pmd_t *pmdp, pmd_t pmdval)
-{
-   *pmdp = pmdval;
-}
-
-#else /* CONFIG_X86_PAE */
-
-static void native_set_pte(pte_t *ptep, pte_t pte)
-{
-   ptep->pte_high = pte.pte_high;
-   smp_wmb();
-   ptep->pte_low = pte.pte_low;
-}
-
-static void native_set_pte_at(struct mm_struct *mm, u32 addr, pte_t *ptep, 
pte_t pte)
-{
-   ptep->pte_high = pte.pte_high;
-   smp_wmb();
-   ptep->pte_low = pte.pte_low;
-}
-
-static void native_set_pte_present(struct mm_struct *mm, unsigned long addr, 
pte_t *ptep, pte_t pte)
-{
-   ptep->pte_low = 0;
-   smp_wmb();
-   ptep->pte_high = pte.pte_high;
-   smp_wmb();
-   ptep->pte_low = pte.pte_low;
-}
-
-static void native_set_pte_atomic(pte_t *ptep, pte_t pteval)
-{
-   set_64bit((unsigned long long *)ptep,pte_val(pteval));
-}
-
-static void native_set_pmd(pmd_t *pmdp, pmd_t pmdval)
-{
-   set_64bit((unsigned long long *)pmdp,pmd_val(pmdval));
-}
-
-static void native_set_pud(pud_t *pudp, pud_t pudval)
-{
-   *pudp = pudval;
-}
-
-static void native_pte_clear(struct mm_struct *mm, unsigned long addr, pte_t 
*ptep)
-{
-   ptep->pte_low = 0;
-   smp_wmb();
-   ptep->pte_high = 0;
-}
-
-static void native_pmd_clear(pmd_t *pmd)
-{
-   u32 *tmp = (u32 *)pmd;
-   *tmp = 0;
-   smp_wmb();
-   *(tmp + 1) = 0;
-}
-#endif /* CONFIG_X86_PAE */
 
 /* These are in entry.S */
 extern void native_iret(void);
@@ -565,13 +493,25 @@ struct paravirt_ops paravirt_ops = {
.set_pmd = native_set_pmd,
.pte_update = paravirt_nop,
.pte_update_defer = paravirt_nop,
+
+   .ptep_get_and_clear = native_ptep_get_and_clear,
+
 #ifdef CONFIG_X86_PAE
.set_pte_atomic = native_set_pte_atomic,
.set_pte_present = native_set_pte_present,
.set_pud = native_set_pud,
.pte_clear = native_pte_clear,
.pmd_clear = native_pmd_clear,
+
+   .pmd_val = native_pmd_val,
+   .make_pmd = native_make_pmd,
 #endif
+
+   .pte_val = native_pte_val,
+   .pgd_val = native_pgd_val,
+
+   .make_pte = native_make_pte,
+   .make_pgd = native_make_pgd,
 
.irq_enable_sysexit = native_irq_enable_sysexit,
.iret = native_iret,
===
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -444,13 +444,13 @@ static void vmi_release_pd(u32 pfn)
 ((level) | (is_current_as(mm, user) ?   \
 (VMI_PAGE_DEFER | VMI_PAGE_CURRENT_AS | ((addr) & 
VMI_PAGE_VA_MASK)) : 0))
 
-static void vmi_update_pte(struct mm_struct *mm, u32 addr, pte_t *ptep)
+static void vmi_update_pte(struct mm_struct *mm, unsigned long addr, pte_t 
*ptep)
 {
vmi_check_page_type(__pa(ptep) >> PAGE_SHIFT, VMI_PAGE_PTE);
vmi_ops.update_pte(ptep, vmi_flags_addr(mm, addr, VMI_PAGE_PT, 0));
 }
 
-static void vmi_update_pte_defer(struct mm_struct *mm, u32 addr, pte_t *ptep)
+static void vmi_update_pte_defer(struct mm_struct *mm, unsigned long addr, 
pte_t *ptep)
 {
vmi_check_page_type(__pa(ptep) >> PAGE_SHIFT, VMI_PAGE_PTE);
vmi_ops.update_pte(ptep, vmi_flags_addr_defer(mm, addr, VMI_PAGE_PT, 
0));
@@ -463,7 +463,7 @@ static void vmi_set_pte(pte_t *ptep, pte
vmi_ops.set_pte(pte, ptep, VMI_PAGE_PT);
 }
 
-static void vmi_set_pte_at(struct mm_struct *mm, u32 addr, pte_t *ptep, pte_t 
pte)
+static void vmi_set_pte_at(struct mm_struct *mm, unsigned long addr, pte_t 
*ptep, pte_t pte)
 {
vmi_check_page_type(__p

Re: [patch 07/20] Allow paravirt backend to choose kernel PMD sharing

2007-04-04 Thread Christoph Lameter
Acked-by: Christoph Lameter <[EMAIL PROTECTED]>

for all thats worth since I am not a i386 specialist.

How much of the issues with page struct sharing between slab and arch code 
does this address?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 07/20] Allow paravirt backend to choose kernel PMD sharing

2007-04-04 Thread Jeremy Fitzhardinge
Christoph Lameter wrote:
> Acked-by: Christoph Lameter <[EMAIL PROTECTED]>
>
> for all thats worth since I am not a i386 specialist.
>
> How much of the issues with page struct sharing between slab and arch code 
> does this address?
>   

I haven't been following that thread as closely as I should be, so I
don't have an answer.  I guess the interesting thing in this patch is
that it only uses the pmd cache for usermode pmds (which are
pre-zeroed), and normal page allocations for kernel pmds.  Also, if the
kernel pmds are unshared, the pgds are page-sized, so its not really
making good use of the pgd cache.

J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


New CPUID/MSR driver; virtualization hooks

2007-04-04 Thread H. Peter Anvin
I have finally gotten off the pot and finished writing up my new 
CPUID/MSR driver, which contains support for registers that need 
arbitrary GPRs touched.  For i386 vs x86-64 compatibility, both use an 
x86-64 register image (16 64-bit register fields); this allows 32-bit 
userspace to access the full 64-bit image if the kernel is 64 bits.

Anyway, this presumably requires new paravirtualization hooks.  The 
patch is at:

http://www.kernel.org/pub/linux/kernel/people/hpa/new-cpuid-msr.patch

... and a git tree is at ...

http://git.kernel.org/?p=linux/kernel/git/hpa/linux-2.6-cpuidmsr.git;a=summary

I'm posting this here to give the paravirt maintainers an opportunity to 
comment.  Presumably the functions that need to be paravirtualized are 
the ones represented by the functions do_cpuid(), do_rdmsr() and 
do_wrmsr(): they take a cpu number, an input register image, and an 
output register image, and return either 0 or -EIO (in case of a trap.)

-hpa


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: New CPUID/MSR driver; virtualization hooks

2007-04-04 Thread Chris Wright
* H. Peter Anvin ([EMAIL PROTECTED]) wrote:
> I have finally gotten off the pot and finished writing up my new 
> CPUID/MSR driver, which contains support for registers that need 
> arbitrary GPRs touched.  For i386 vs x86-64 compatibility, both use an 
> x86-64 register image (16 64-bit register fields); this allows 32-bit 
> userspace to access the full 64-bit image if the kernel is 64 bits.
> 
> Anyway, this presumably requires new paravirtualization hooks.  The 
> patch is at:
> 
> http://www.kernel.org/pub/linux/kernel/people/hpa/new-cpuid-msr.patch

Not mirrored out yet
> 
> ... and a git tree is at ...
> 
> http://git.kernel.org/?p=linux/kernel/git/hpa/linux-2.6-cpuidmsr.git;a=summary

Bleah, and gitweb is unhappy ATM too.

> I'm posting this here to give the paravirt maintainers an opportunity to 
> comment.  Presumably the functions that need to be paravirtualized are 
> the ones represented by the functions do_cpuid(), do_rdmsr() and 
> do_wrmsr(): they take a cpu number, an input register image, and an 
> output register image, and return either 0 or -EIO (in case of a trap.)

Yes, so currently cpuid, for example, is like this:

do_cpuid
  cpuid
   __cpuid

Where __cpuid is
native_cpuid() on !CONFIG_PARAVIRT (include/asm-i386/processor.h)
(and this is real asm("cpuid"))
and
paravirt_ops.cpuid() on CONFIG_PARAVIRT (

Without having seen the patch yet, you'll need to make sure
that the final point which is issuing asm("cpuid") is wrapped
and split to CONFIG_PARAVIRT and non CONFIG_PARAVIRT modes.

Similar for rdmsr:

do_rdmsr
  rdmsr_eio
rdmsr_safe

Where rdmsr is paravirtualized
rdmsr is asm("rdmsr") on !CONFIG_PARAVIRT (include/asm-i386/msr.h)
and
paravirt_ops.read_msr() on CONFIG_PARAVIRT (include/asm-i386/paravirt.h)

Similar for do_wrmsr.

Does that answer your question?

thanks,
-chris
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: New CPUID/MSR driver; virtualization hooks

2007-04-04 Thread H. Peter Anvin
Chris Wright wrote:
>>
>> http://git.kernel.org/?p=linux/kernel/git/hpa/linux-2.6-cpuidmsr.git;a=summary
> 
> Bleah, and gitweb is unhappy ATM too.
> 

??? Works for me?

> Without having seen the patch yet, you'll need to make sure
> that the final point which is issuing asm("cpuid") is wrapped
> and split to CONFIG_PARAVIRT and non CONFIG_PARAVIRT modes.

It's not *quite* that easy.  The assembly code around this is pretty 
extensive, because it has to stand on its head in order to present the 
proper register image.

Pretty much as far as I can see it, there are two possible points where 
one can break out CONFIG_PARAVIRT:

a) int do_foo(int cpu, const u64 ireg[16], u64 oreg[16]);

b) int foo_everything(const u64 ireg[16], u64 oreg[16]);

The difference, of course, is that the former is invoked on the 
originating CPU and the latter on the target CPU at interrupt level. 
Those are pretty much the choices.

-hpa
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 07/20] Allow paravirt backend to choose kernel PMD sharing

2007-04-04 Thread Chris Wright
* Christoph Lameter ([EMAIL PROTECTED]) wrote:
> Acked-by: Christoph Lameter <[EMAIL PROTECTED]>
> 
> for all thats worth since I am not a i386 specialist.
> 
> How much of the issues with page struct sharing between slab and arch code 
> does this address?

I think the answer is 'none yet.'  It uses page sized slab and still
needs pgd_list, for example.  But the mm_list chaining should work too,
so it shouldn't make things any worse.

thanks,
-chris
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: New CPUID/MSR driver; virtualization hooks

2007-04-04 Thread Tony Breeds
On Wed, Apr 04, 2007 at 05:50:58PM -0700, H. Peter Anvin wrote:
> I have finally gotten off the pot and finished writing up my new 
> CPUID/MSR driver, which contains support for registers that need 
> arbitrary GPRs touched.  For i386 vs x86-64 compatibility, both use an 
> x86-64 register image (16 64-bit register fields); this allows 32-bit 
> userspace to access the full 64-bit image if the kernel is 64 bits.
> 
> Anyway, this presumably requires new paravirtualization hooks.  The 
> patch is at:
> 
> http://www.kernel.org/pub/linux/kernel/people/hpa/new-cpuid-msr.patch

I think you mean?

http://www.kernel.org/pub/linux/kernel/people/hpa/new-msr-cpuid.patch

Yours Tony

  linux.conf.auhttp://linux.conf.au/ || http://lca2008.linux.org.au/
  Jan 28 - Feb 02 2008 The Australian Linux Technical Conference!

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] Unified lguest launcher

2007-04-04 Thread Rusty Russell
On Wed, 2007-04-04 at 13:03 -0300, Glauber de Oliveira Costa wrote:
> This is a new version of the unified lguest launcher that applies to
> the current tree. According to rusty's suggestion, I'm bothering less
> to be able to load 32 bit kernels on 64-bit machines: changing the
> launcher for such case would be the easy part! In the absence of
> further objections, I'll commit it.
> 
> Signed-off-by: Glauber de Oliveira Costa <[EMAIL PROTECTED]>

Hi Glauber!

The patch looks more than reasonable, but I think we can go further with
the abstraction.  If you could spin it again, I'll apply it.  There may
be more cleanups after that, but I don't want to hold up your progress!

> --- /dev/null
> +++ linux-2.6.20/Documentation/lguest/i386/defines
> @@ -0,0 +1,4 @@
> +# We rely on CONFIG_PAGE_OFFSET to know where to put lguest binary.
> +# Some shells (dash - ubunu) can't handle numbers that big so we cheat.
> +include ../../.config
> +LGUEST_GUEST_TOP := ($(CONFIG_PAGE_OFFSET) - 0x0800)

The include needs another ../ and seems redundant (the .config is
included from the Makefile anyway).

The shells comment is obsolete and should be deleted too, my bad.

> +++ linux-2.6.20/Documentation/lguest/i386/lguest_defs.h
> @@ -0,0 +1,9 @@
> +#ifndef _LGUEST_DEFS_H_
> +#define _LGUEST_DEFS_H_
> +
> +/* LGUEST_TOP_ADDRESS comes from the Makefile */
> +#define RESERVE_TOP_ADDRESS LGUEST_GUEST_TOP - 1024*1024

Why -1M?  And RESERVE_TOP_ADDRESS isn't used in this patch?

> +static unsigned long map_elf(int elf_fd, const void *hdr, 
>  unsigned long *page_offset)
>  {
> -   void *addr;
> +#ifndef __x86_64__
> +   const Elf32_Ehdr *ehdr = hdr;
> Elf32_Phdr phdr[ehdr->e_phnum];
> +#else
> +   const Elf64_Ehdr *ehdr = hdr;
> +   Elf64_Phdr phdr[ehdr->e_phnum];
> +#endif

The way we did this in the module code was to define Elf_Ehdr etc in the
arch-specific headers to avoid ifdefs.  I think it would help this code,
too. 

> +   || ((ehdr->e_machine != EM_386) &&
> +   (ehdr->e_machine != EM_X86_64))

Similarly define ELF_MACHINE?

>else if (*page_offset != phdr[i].p_vaddr - phdr[i].p_paddr)
> +   else if ((*page_offset != phdr[i].p_vaddr - phdr[i].p_paddr)
> +#ifdef __x86_64__
> +&& (phdr[i].p_vaddr != VSYSCALL_START)
> +#endif
> +   )

Hmm, static inline bool is_vsyscall_segment(const Elf_Phdr *) maybe?

> +/* LGUEST_TOP_ADDRESS comes from the Makefile */
> +typedef uint64_t u64;
> +#include "../../../include/asm/lguest_user.h"
> +
> +#define RESERVE_TOP_ADDRESS LGUEST_GUEST_TOP
> +
> +
> +#define BOOT_PGTABLE "boot_level4_pgt"

The comment should refer to LGUEST_GUEST_TOP?

I think the typedef should be in the main code with the others: it
doesn't hurt i386 and it's neater.

I'm not sure the BOOT_PGTABLE define helps us here, either; it might be
clearer just to put it directly into the code.

Cheers!
Rusty.



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] lguest32 kallsyms backtrace of guest.

2007-04-04 Thread Rusty Russell
On Wed, 2007-04-04 at 14:23 -0400, Steven Rostedt wrote:
> This is taken from the work I did on lguest64.
> 
> When killing a guest, we read the guest stack to do a nice back trace of
> the guest and send it via printk to the host.
> 
> So instead of just getting an error message from the lguest launcher of:
> 
> lguest: bad read address 537012178 len 1
> 
> I also get in my dmesg:
> 
> called from  [] show_trace_log_lvl+0x1a/0x2f

Hi Steven,

This is a cool idea, but there are two issues with this patch.  The
first is that it's 500 lines of code: that's around +10% on lguest's
total code size!  The second is that it conflicts with the medium-term
plan to allow any user to run up lguests: this is why lg.ko never
printk()s about problems with the guest.

While it is useful for cases where a guest dies mysteriously before it
brings up the console, three alternatives come to mind:

1) Modify early_printk so Guests can use it.
2) Have a separate tool(-set?) for this kind of post-mortem.  Then you
just have to implement guest suspend! 8)
3) Put this in a CONFIG_LGUEST_DEBUG.

Note that options 1 or 2 make you do more work, but are probably better
in the long term.  I'm happy for #3 to sit as a patch in the tree for
the duration, tho!

Cheers,
Rusty.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] Lguest32, use guest page tables to find paddr for emulated instructions

2007-04-04 Thread Rusty Russell
On Wed, 2007-04-04 at 15:07 -0400, Steven Rostedt wrote:
> [Bug that was found by my previous patch]
> 
> This patch allows things like modules, which don't have a direct
> __pa(EIP) mapping to do emulated instructions.
> 
> Sure, the emulated instruction probably should be a paravirt_op, but
> this patch lets you at least boot a kernel that has modules needing
> emulated instructions.

Yeah, I haven't tried loading random modules but I can imagine this does
happen (what module was it, BTW?)

I used to have a function just like this, but managed to get rid of
it.  

Hmm, perhaps we should have an "int lgread_virt_byte(u8 *)" which does
the pgtable walk and read all in one?  It won't be efficient, but it'll
be more correct and maybe even fewer lines 8)

Thanks for the patch!
Rusty.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] Lguest32 print hex on bad reads and writes

2007-04-04 Thread Rusty Russell
On Wed, 2007-04-04 at 15:14 -0400, Steven Rostedt wrote:
> Currently the lguest32 error messages from bad reads and writes prints a
> decimal integer for addresses. This is pretty annoying. So this patch
> changes those to be hex outputs.

(Erk, I wonder what I was thinking when I wrote that?)

Can I ask for %#x (or 0x%x)?  I'm easily confused.

Thanks!
Rusty.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] Lguest32 print hex on bad reads and writes

2007-04-04 Thread Kyle Moffett
On Apr 04, 2007, at 23:01:30, Rusty Russell wrote:
> On Wed, 2007-04-04 at 15:14 -0400, Steven Rostedt wrote:
>> Currently the lguest32 error messages from bad reads and writes  
>> prints a decimal integer for addresses. This is pretty annoying.  
>> So this patch changes those to be hex outputs.
>
> (Erk, I wonder what I was thinking when I wrote that?) Can I ask  
> for %#x (or 0x%x)?  I'm easily confused.

How about "%p" for pointers?

Cheers,
Kyle Moffett


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] Lguest32 print hex on bad reads and writes

2007-04-04 Thread Steven Rostedt
On Wed, 2007-04-04 at 23:06 -0400, Kyle Moffett wrote:

> > (Erk, I wonder what I was thinking when I wrote that?) Can I ask  
> > for %#x (or 0x%x)?  I'm easily confused.
> 
> How about "%p" for pointers?

But that would require casting the numbers to pointers.

-- Steve


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] lguest32 kallsyms backtrace of guest.

2007-04-04 Thread Steven Rostedt
On Thu, 2007-04-05 at 12:54 +1000, Rusty Russell wrote:

> 
>   This is a cool idea, but there are two issues with this patch.  The
> first is that it's 500 lines of code: that's around +10% on lguest's
> total code size!  The second is that it conflicts with the medium-term
> plan to allow any user to run up lguests: this is why lg.ko never
> printk()s about problems with the guest.

Not much I can do about the size, but it's in the debug section so
hopefully it's not considered too bad :)

> 
> While it is useful for cases where a guest dies mysteriously before it
> brings up the console, three alternatives come to mind:
> 
> 1) Modify early_printk so Guests can use it.
> 2) Have a separate tool(-set?) for this kind of post-mortem.  Then you
> just have to implement guest suspend! 8)
> 3) Put this in a CONFIG_LGUEST_DEBUG.
> 
> Note that options 1 or 2 make you do more work, but are probably better
> in the long term.  I'm happy for #3 to sit as a patch in the tree for
> the duration, tho!

OK, I'll make a #3 patch to send, but the #1 looks best. Not to mention
that I still need to make it so that the console can read it.

-- Steve


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] Lguest32, use guest page tables to find paddr for emulated instructions

2007-04-04 Thread Steven Rostedt
On Thu, 2007-04-05 at 12:59 +1000, Rusty Russell wrote:
> On Wed, 2007-04-04 at 15:07 -0400, Steven Rostedt wrote:

> Yeah, I haven't tried loading random modules but I can imagine this does
> happen (what module was it, BTW?)

I have no idea of which module it crashed on. I didn't investigate that
too much.  I could simply send a trap to guest when 
__pa(addr) != lguest_find_guest_paddr(addr) and see which module it
crashed on.

My block device I used was basically a copy of a RHEL5 system. I only
modified the inittab and fstab to get it working.  So on startup and
doing the udev init was when it crashed.

> 
> I used to have a function just like this, but managed to get rid of
> it.  
> 
> Hmm, perhaps we should have an "int lgread_virt_byte(u8 *)" which does
> the pgtable walk and read all in one?  It won't be efficient, but it'll
> be more correct and maybe even fewer lines 8)

I forgot that you have a goal to keep lguest small :)

Perhaps we can fork, and have lguest and lguest-lite.

-- Steve

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] Lguest32 print hex on bad reads and writes

2007-04-04 Thread Rusty Russell
On Wed, 2007-04-04 at 23:14 -0400, Steven Rostedt wrote:
> On Wed, 2007-04-04 at 23:06 -0400, Kyle Moffett wrote:
> 
> > > (Erk, I wonder what I was thinking when I wrote that?) Can I ask  
> > > for %#x (or 0x%x)?  I'm easily confused.
> > 
> > How about "%p" for pointers?
> 
> But that would require casting the numbers to pointers.

And the kernel's printk doesn't put 0x on pointers anyway, last I
checked 8(

Rusty.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: New CPUID/MSR driver; virtualization hooks

2007-04-04 Thread Zachary Amsden
H. Peter Anvin wrote:
> I have finally gotten off the pot and finished writing up my new 
> CPUID/MSR driver, which contains support for registers that need 
> arbitrary GPRs touched.  For i386 vs x86-64 compatibility, both use an 
> x86-64 register image (16 64-bit register fields); this allows 32-bit 
> userspace to access the full 64-bit image if the kernel is 64 bits.
>
> Anyway, this presumably requires new paravirtualization hooks.  The 
> patch is at:
>
> http://www.kernel.org/pub/linux/kernel/people/hpa/new-cpuid-msr.patch
>   
The requested URL /pub/linux/kernel/people/hpa/new-cpuid-msr.patch was 
not found on this server.

> ... and a git tree is at ...
>
> http://git.kernel.org/?p=linux/kernel/git/hpa/linux-2.6-cpuidmsr.git;a=summary
>
> I'm posting this here to give the paravirt maintainers an opportunity to 
> comment.  Presumably the functions that need to be paravirtualized are 
> the ones represented by the functions do_cpuid(), do_rdmsr() and
>   

rdmsr / wrmsr can be dropped from paravirt-ops; at least for us (they 
will trap and emulate just fine, and this driver is not performance 
critical), and I think for the others as well.  CPUID, however, does 
require a hook.

Zach
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: New CPUID/MSR driver; virtualization hooks

2007-04-04 Thread Zachary Amsden
H. Peter Anvin wrote:
> It's not *quite* that easy.  The assembly code around this is pretty 
> extensive, because it has to stand on its head in order to present the 
> proper register image.
>   

Having just stood on my head for 55 breaths, might I suggest we 
implement a binary equivalent CPUID paravirt-ops wrapper; then the 
assembly code can just call CPUID and we can redefine it to call to a 
stub, which makes the pv-ops CPUID call, the puts the outputs back in 
the proper registers.

The VMI ROM CPUID is binary identical in register format with the native 
instruction, so it need not do a headstand.  For the others,

ENTRY(paravirt_raw_cpuid)
pushl %edx
pushl %ecx
pushl %ebx
pushl %eax
mov %esp, %eax/* pointer for eax in/out */
leal 4(%esp), %edx /* pointer for ebx in/out */
leal 8(%esp), %ecx /* pointer for ecx in/out */
leal 12(%esp), %ebx /* pointer for edx in/out */
pushl %ebx  /* arg 4 passed on stack */
call *(paravirt_ops+pv_offset_CPUID)
addl $4, %esp
popl %eax
popl %ebx
popl %ecx
popl %edx
ret

Should do the right thing.  In any case, thanks for the heads up.

Zach
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 20/20] Add apply_to_page_range() which applies a function to a pte range.

2007-04-04 Thread Matt Mackall
On Wed, Apr 04, 2007 at 12:12:11PM -0700, Jeremy Fitzhardinge wrote:
> Add a new mm function apply_to_page_range() which applies a given
> function to every pte in a given virtual address range in a given mm
> structure. This is a generic alternative to cut-and-pasting the Linux
> idiomatic pagetable walking code in every place that a sequence of
> PTEs must be accessed.

As we discussed before, this obviously has a lot in common with my
walk_page_range code.

The major difference and one your above description seems to be
missing the important detail of why it's doing this:

> + pte_alloc_kernel(pmd, addr) :
> + pmd = pmd_alloc(mm, pud, addr);
> + pud = pud_alloc(mm, pgd, addr);

..which is mentioned here:

> +/*
> + * Scan a region of virtual memory, filling in page tables as necessary
> + * and calling a provided function on each leaf page table.
> + */

But I'm not sure what the use case is that wants filling in the page
table..? If both modes really make sense, perhaps a flag could unify
these differences.

> +typedef int (*pte_fn_t)(pte_t *pte, struct page *pmd_page, unsigned long 
> addr,
> + void *data);

I'd gotten the impression that these sorts of typedefs were out of
fashion.

> +static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
> +  unsigned long addr, unsigned long end,
> +  pte_fn_t fn, void *data)
> +{
> + pte_t *pte;
> + int err;
> + struct page *pmd_page;
> + spinlock_t *ptl;
> +
> + pte = (mm == &init_mm) ?
> + pte_alloc_kernel(pmd, addr) :
> + pte_alloc_map_lock(mm, pmd, addr, &ptl);
> + if (!pte)
> + return -ENOMEM;

Seems a bit awkward to pass mm all the way down the tree just for this
quirk. Which is a bit awkward as it means that whether or not a lock
is held in the callback is context dependent.

smaps, clear_ref, and my pagemap code all use the callback at the
pmd_range level, which a) localizes the pte-level locking concerns
with the user b) amortizes the indirection overhead and c)
(unfortunately) makes the user a bit more complex.

We should try to measure whether (b) actually makes a difference.

> + do {
> + err = fn(pte, pmd_page, addr, data);
> + if (err)
> + break;
> + } while (pte++, addr += PAGE_SIZE, addr != end);

I was about to say this do/while format seems a bit non-idiomatic for
page table walkers, but then I looked at the code in mm/memory.c and
realized the stuff I've been hacking on is the odd one out.

-- 
Mathematics is the supreme nostalgia of our time.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[patch 0/2] Updates to compat VDSOs

2007-04-04 Thread Jeremy Fitzhardinge
Hi Andi,

Here's a couple of patches to fix up COMPAT_VDSO:

The first is a straightforward implementation of Jan's original idea
of relocating the VDSO to match its mapped location.  Unlike Jan and
Zach's version, I changed it to relocate based on the phdrs rather than
the sections; the result is pleasantly compact.

The second patch takes advantage of the fact that all the COMPAT_VDSO work
happens at runtime now, and allows compat mode to be enabled dynamically.
If you specify vdso=2 on the kernel command line, it comes up in compat
mode; vdso=1 is normal vdso mode, and vdso=0 disables vdso altogether.
You can also switch modes with sysctl.

Thanks,
J

-- 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


[patch 2/2] Make COMPAT_VDSO runtime selectable.

2007-04-04 Thread Jeremy Fitzhardinge
Now that relocation of the VDSO for COMPAT_VDSO users is done at
runtime rather than compile time, it is possible to enable/disable
compat mode at runtime.

This patch allows you to enable COMPAT_VDSO mode with "vdso=2" on the
kernel command line, or via sysctl.

The COMPAT_VDSO config option still exists, but if enabled it just
makes vdso_enabled default to VDSO_COMPAT.

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Zachary Amsden <[EMAIL PROTECTED]>
Cc: "Jan Beulich" <[EMAIL PROTECTED]>
Cc: Eric W. Biederman <[EMAIL PROTECTED]>
Cc: Andi Kleen <[EMAIL PROTECTED]>
Cc: Ingo Molnar <[EMAIL PROTECTED]>
Cc: Roland McGrath <[EMAIL PROTECTED]>

---
 Documentation/kernel-parameters.txt |1 
 arch/i386/kernel/sysenter.c |  131 ++-
 include/asm-i386/page.h |2 
 3 files changed, 84 insertions(+), 50 deletions(-)

===
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1807,6 +1807,7 @@ and is between 256 and 4096 characters. 
[USBHID] The interval which mice are to be polled at.
 
vdso=   [IA-32,SH]
+   vdso=2: enable compat VDSO (default with COMPAT_VDSO)
vdso=1: enable VDSO (default)
vdso=0: disable VDSO mapping
 
===
--- a/arch/i386/kernel/sysenter.c
+++ b/arch/i386/kernel/sysenter.c
@@ -24,11 +24,23 @@
 #include 
 #include 
 
+enum {
+   VDSO_DISABLED = 0,
+   VDSO_ENABLED = 1,
+   VDSO_COMPAT = 2,
+};
+
+#ifdef CONFIG_COMPAT_VDSO
+#define VDSO_DEFAULT   VDSO_COMPAT
+#else
+#define VDSO_DEFAULT   VDSO_ENABLED
+#endif
+
 /*
  * Should the kernel map a VDSO page into processes and pass its
  * address down to glibc upon exec()?
  */
-unsigned int __read_mostly vdso_enabled = 1;
+unsigned int __read_mostly vdso_enabled = VDSO_DEFAULT;
 
 EXPORT_SYMBOL_GPL(vdso_enabled);
 
@@ -43,7 +55,6 @@ __setup("vdso=", vdso_setup);
 
 extern asmlinkage void sysenter_entry(void);
 
-#ifdef CONFIG_COMPAT_VDSO
 static __cpuinit void reloc_dyn(Elf32_Ehdr *ehdr, unsigned offset)
 {
Elf32_Dyn *dyn = (void *)ehdr + offset;
@@ -85,11 +96,6 @@ static __cpuinit void relocate_vdso(Elf3
reloc_dyn(ehdr, phdr[i].p_offset);
}
 }
-#else
-static inline void relocate_vdso(Elf32_Ehdr *ehdr)
-{
-}
-#endif /* COMPAT_VDSO */
 
 void enable_sep_cpu(void)
 {
@@ -109,6 +115,25 @@ void enable_sep_cpu(void)
put_cpu();  
 }
 
+static struct vm_area_struct gate_vma;
+
+static int __cpuinit gate_vma_init(void)
+{
+   gate_vma.vm_mm = NULL;
+   gate_vma.vm_start = FIXADDR_USER_START;
+   gate_vma.vm_end = FIXADDR_USER_END;
+   gate_vma.vm_flags = VM_READ | VM_MAYREAD | VM_EXEC | VM_MAYEXEC;
+   gate_vma.vm_page_prot = __P101;
+   /*
+* Make sure the vDSO gets into every core dump.
+* Dumping its contents makes post-mortem fully interpretable later
+* without matching up the same kernel and hardware config to see
+* what PC values meant.
+*/
+   gate_vma.vm_flags |= VM_ALWAYSDUMP;
+   return 0;
+}
+
 /*
  * These symbols are defined by vsyscall.o to mark the bounds
  * of the ELF DSO images included therein.
@@ -117,6 +142,19 @@ extern const char vsyscall_sysenter_star
 extern const char vsyscall_sysenter_start, vsyscall_sysenter_end;
 static struct page *syscall_pages[1];
 
+static void map_compat_vdso(int map)
+{
+   static int vdso_mapped;
+
+   if (map == vdso_mapped)
+   return;
+
+   vdso_mapped = map;
+
+   __set_fixmap(FIX_VDSO, page_to_pfn(syscall_pages[0]) << PAGE_SHIFT,
+map ? PAGE_READONLY_EXEC : PAGE_NONE);
+}
+
 int __cpuinit sysenter_setup(void)
 {
void *syscall_page = (void *)get_zeroed_page(GFP_ATOMIC);
@@ -125,10 +163,9 @@ int __cpuinit sysenter_setup(void)
 
syscall_pages[0] = virt_to_page(syscall_page);
 
-#ifdef CONFIG_COMPAT_VDSO
-   __set_fixmap(FIX_VDSO, __pa(syscall_page), PAGE_READONLY_EXEC);
+   gate_vma_init();
+
printk("Compat vDSO mapped to %08lx.\n", __fix_to_virt(FIX_VDSO));
-#endif
 
if (!boot_cpu_has(X86_FEATURE_SEP)) {
vsyscall = &vsyscall_int80_start;
@@ -147,7 +184,6 @@ int __cpuinit sysenter_setup(void)
 /* Defined in vsyscall-sysenter.S */
 extern void SYSENTER_RETURN;
 
-#ifdef __HAVE_ARCH_GATE_AREA
 /* Setup a VMA at program startup for the vsyscall page */
 int arch_setup_additional_pages(struct linux_binprm *bprm, int exstack)
 {
@@ -156,33 +192,44 @@ int arch_setup_additional_pages(struct l
int ret;
 
down_write(&mm->mmap_sem);
-   addr = get_unmapped_area(NULL, 0, PAGE_SIZE, 0, 0);
-   if (IS_ERR_VALUE(addr)) {
-   ret = addr;
-   goto up_fail;
-   }
-
-   /*
-* MAYWRITE to allow

[patch 1/2] Relocate VDSO ELF headers to match mapped location with COMPAT_VDSO

2007-04-04 Thread Jeremy Fitzhardinge
Some versions of libc can't deal with a VDSO which doesn't have its
ELF headers matching its mapped address.  COMPAT_VDSO maps the VDSO at
a specific system-wide fixed address.  Previously this was all done at
build time, on the grounds that the fixed VDSO address is always at
the top of the address space.  However, a hypervisor may reserve some
of that address space, pushing the fixmap address down.

This patch does the adjustment dynamically at runtime, depending on
the runtime location of the VDSO fixmap.

[ Patch has been through several hands: Jan Beulich wrote the orignal
  version; Zach reworked it, and Jeremy converted it to relocate phdrs
  rather than sections. ]

Signed-off-by: Jeremy Fitzhardinge <[EMAIL PROTECTED]>
Cc: Zachary Amsden <[EMAIL PROTECTED]>
Cc: "Jan Beulich" <[EMAIL PROTECTED]>
Cc: Eric W. Biederman <[EMAIL PROTECTED]>
Cc: Andi Kleen <[EMAIL PROTECTED]>
Cc: Ingo Molnar <[EMAIL PROTECTED]>
Cc: Roland McGrath <[EMAIL PROTECTED]>

---
 arch/i386/kernel/entry.S|4 -
 arch/i386/kernel/sysenter.c |   95 ---
 arch/i386/mm/pgtable.c  |6 --
 include/asm-i386/elf.h  |   28 
 include/asm-i386/fixmap.h   |8 ---
 include/linux/elf.h |3 +
 6 files changed, 95 insertions(+), 49 deletions(-)

===
--- a/arch/i386/kernel/entry.S
+++ b/arch/i386/kernel/entry.S
@@ -305,16 +305,12 @@ sysenter_past_esp:
pushl $(__USER_CS)
CFI_ADJUST_CFA_OFFSET 4
/*CFI_REL_OFFSET cs, 0*/
-#ifndef CONFIG_COMPAT_VDSO
/*
 * Push current_thread_info()->sysenter_return to the stack.
 * A tiny bit of offset fixup is necessary - 4*4 means the 4 words
 * pushed above; +8 corresponds to copy_thread's esp0 setting.
 */
pushl (TI_sysenter_return-THREAD_SIZE+8+4*4)(%esp)
-#else
-   pushl $SYSENTER_RETURN
-#endif
CFI_ADJUST_CFA_OFFSET 4
CFI_REL_OFFSET eip, 0
 
===
--- a/arch/i386/kernel/sysenter.c
+++ b/arch/i386/kernel/sysenter.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * Should the kernel map a VDSO page into processes and pass its
@@ -41,6 +42,54 @@ __setup("vdso=", vdso_setup);
 __setup("vdso=", vdso_setup);
 
 extern asmlinkage void sysenter_entry(void);
+
+#ifdef CONFIG_COMPAT_VDSO
+static __cpuinit void reloc_dyn(Elf32_Ehdr *ehdr, unsigned offset)
+{
+   Elf32_Dyn *dyn = (void *)ehdr + offset;
+
+   for(; dyn->d_tag != DT_NULL; dyn++)
+   switch(dyn->d_tag) {
+   case DT_PLTGOT:
+   case DT_HASH:
+   case DT_STRTAB:
+   case DT_SYMTAB:
+   case DT_RELA:
+   case DT_INIT:
+   case DT_FINI:
+   case DT_REL:
+   case DT_JMPREL:
+   case DT_VERSYM:
+   case DT_VERDEF:
+   case DT_VERNEED:
+   dyn->d_un.d_val += VDSO_HIGH_BASE;
+   }
+}
+
+static __cpuinit void relocate_vdso(Elf32_Ehdr *ehdr)
+{
+   Elf32_Phdr *phdr;
+   int i;
+
+   BUG_ON(memcmp(ehdr->e_ident, ELFMAG, 4) != 0 ||
+  !elf_check_arch(ehdr) ||
+  ehdr->e_type != ET_DYN);
+
+   ehdr->e_entry += VDSO_HIGH_BASE;
+
+   phdr = (void *)ehdr + ehdr->e_phoff;
+   for (i = 0; i < ehdr->e_phnum; i++) {
+   phdr[i].p_vaddr += VDSO_HIGH_BASE;
+
+   if (phdr[i].p_type == PT_DYNAMIC)
+   reloc_dyn(ehdr, phdr[i].p_offset);
+   }
+}
+#else
+static inline void relocate_vdso(Elf32_Ehdr *ehdr)
+{
+}
+#endif /* COMPAT_VDSO */
 
 void enable_sep_cpu(void)
 {
@@ -71,6 +120,9 @@ int __cpuinit sysenter_setup(void)
 int __cpuinit sysenter_setup(void)
 {
void *syscall_page = (void *)get_zeroed_page(GFP_ATOMIC);
+   const void *vsyscall;
+   size_t vsyscall_len;
+
syscall_pages[0] = virt_to_page(syscall_page);
 
 #ifdef CONFIG_COMPAT_VDSO
@@ -79,23 +131,23 @@ int __cpuinit sysenter_setup(void)
 #endif
 
if (!boot_cpu_has(X86_FEATURE_SEP)) {
-   memcpy(syscall_page,
-  &vsyscall_int80_start,
-  &vsyscall_int80_end - &vsyscall_int80_start);
-   return 0;
-   }
-
-   memcpy(syscall_page,
-  &vsyscall_sysenter_start,
-  &vsyscall_sysenter_end - &vsyscall_sysenter_start);
-
-   return 0;
-}
-
-#ifndef CONFIG_COMPAT_VDSO
+   vsyscall = &vsyscall_int80_start;
+   vsyscall_len = &vsyscall_int80_end - &vsyscall_int80_start;
+   } else {
+   vsyscall = &vsyscall_sysenter_start;
+   vsyscall_len = &vsyscall_sysenter_end - 
&vsyscall_sysenter_start;
+   }
+
+   memcpy(syscall_page, vsyscall, vsyscall_len);
+   relocate_vdso(syscall_page);
+
+   return 0;
+}
+
 /* Defined in vsyscall-sysenter.

Re: [PATCH] Lguest32 print hex on bad reads and writes

2007-04-04 Thread H. Peter Anvin
Rusty Russell wrote:
> On Wed, 2007-04-04 at 23:14 -0400, Steven Rostedt wrote:
>> On Wed, 2007-04-04 at 23:06 -0400, Kyle Moffett wrote:
>>
 (Erk, I wonder what I was thinking when I wrote that?) Can I ask  
 for %#x (or 0x%x)?  I'm easily confused.
>>> How about "%p" for pointers?
>> But that would require casting the numbers to pointers.
> 
> And the kernel's printk doesn't put 0x on pointers anyway, last I
> checked 8(
> 

That's really the bug.  Let's fix it.

-hpa
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [PATCH] Lguest32 print hex on bad reads and writes

2007-04-04 Thread H. Peter Anvin
H. Peter Anvin wrote:
> Rusty Russell wrote:
>> On Wed, 2007-04-04 at 23:14 -0400, Steven Rostedt wrote:
>>> On Wed, 2007-04-04 at 23:06 -0400, Kyle Moffett wrote:
>>>
> (Erk, I wonder what I was thinking when I wrote that?) Can I ask  
> for %#x (or 0x%x)?  I'm easily confused.
 How about "%p" for pointers?
>>> But that would require casting the numbers to pointers.
>>
>> And the kernel's printk doesn't put 0x on pointers anyway, last I
>> checked 8(
>>
> 
> That's really the bug.  Let's fix it.
> 
> -hpa

Okay, git tree at:

http://git.kernel.org/?p=linux/kernel/git/hpa/linux-2.6-printk.git;a=summary
git://git.kernel.org/pub/scm/linux/kernel/git/hpa/linux-2.6-printk.git

... will do actual testing and post it to LKML tomorrow.

-hpa
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 1/2] Relocate VDSO ELF headers to match mapped location with COMPAT_VDSO

2007-04-04 Thread Roland McGrath
The patch looks nice and clean.  However, it does not relocate the symbol
table(s) values.  I thought that was done in an earlier version of this I
saw, but I might be misremembering.  Though not fatal, this is a regression
from the previous CONFIG_COMPAT_VDSO behavior.  It will show up in things
like __kernel_* name display in backtraces.  If with your other patch
CONFIG_COMPAT_VDSO will become other than a rarely-used compatibility
option, then this should be fixed.  Note that with your second patch this
will also break the symbol values in the randomly-located vma vdso;
non-ancient glibc doesn't care if the vdso isn't mapped where its phdrs
say, but everything does still care that the symbol tables in an ELF file
use addresses matching the phdrs in the same file.


Thanks,
Roland
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 1/2] Relocate VDSO ELF headers to match mapped location with COMPAT_VDSO

2007-04-04 Thread Jeremy Fitzhardinge
Roland McGrath wrote:
> The patch looks nice and clean.  However, it does not relocate the symbol
> table(s) values.  I thought that was done in an earlier version of this I
> saw, but I might be misremembering.  Though not fatal, this is a regression
> from the previous CONFIG_COMPAT_VDSO behavior.  It will show up in things
> like __kernel_* name display in backtraces.

Hm, OK.  It does, but I wasn't sure if it would matter.  It should be
fairly simple to fix up.

>   If with your other patch
> CONFIG_COMPAT_VDSO will become other than a rarely-used compatibility
> option, then this should be fixed.  Note that with your second patch this
> will also break the symbol values in the randomly-located vma vdso;
> non-ancient glibc doesn't care if the vdso isn't mapped where its phdrs
> say, but everything does still care that the symbol tables in an ELF file
> use addresses matching the phdrs in the same file.
>   

I did the second patch because I could, and to see if it would provoke
some comment.  But effectively removing a kernel config option seems
like a good idea to me.

J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization


Re: [patch 20/20] Add apply_to_page_range() which applies a function to a pte range.

2007-04-04 Thread Jeremy Fitzhardinge
Matt Mackall wrote:
>> +/*
>> + * Scan a region of virtual memory, filling in page tables as necessary
>> + * and calling a provided function on each leaf page table.
>> + */
>> 
>
> But I'm not sure what the use case is that wants filling in the page
> table..? If both modes really make sense, perhaps a flag could unify
> these differences.
>   

Well, two reasons:

One is the general one that if you're traversing ptes then they need to
exist to traverse them (for example, if you're creating new mappings). 
Obviously if you want to just visit existing mappings, then
instantiating new pagetable is not the right thing to do (and I could
make use of this too).

The other is that there are various places in the Xen hypervisor API
where you pass in a reference to pte entry for the hypervisor to put
mappings into, and the rest of the pagetable needs to exist.  The Xen
code uses the side-effect of apply_to_page_range() to create pagetable
for these calls.

>> +typedef int (*pte_fn_t)(pte_t *pte, struct page *pmd_page, unsigned long 
>> addr,
>> +void *data);
>> 
>
> I'd gotten the impression that these sorts of typedefs were out of
> fashion.
>   

In general yes, but for function pointers the syntax is so clumsy that I
think typedefs are OK.

>> +static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
>> + unsigned long addr, unsigned long end,
>> + pte_fn_t fn, void *data)
>> +{
>> +pte_t *pte;
>> +int err;
>> +struct page *pmd_page;
>> +spinlock_t *ptl;
>> +
>> +pte = (mm == &init_mm) ?
>> +pte_alloc_kernel(pmd, addr) :
>> +pte_alloc_map_lock(mm, pmd, addr, &ptl);
>> +if (!pte)
>> +return -ENOMEM;
>> 
>
> Seems a bit awkward to pass mm all the way down the tree just for this
> quirk. Which is a bit awkward as it means that whether or not a lock
> is held in the callback is context dependent.
>   

Well, it would need mm for just pte_alloc_map_lock() anyway.

> smaps, clear_ref, and my pagemap code all use the callback at the
> pmd_range level, which a) localizes the pte-level locking concerns
> with the user b) amortizes the indirection overhead and c)
> (unfortunately) makes the user a bit more complex.
>
> We should try to measure whether (b) actually makes a difference.
>   

I'll need to look closely at your code again.  It would be nice to have
one pagewalker.

J
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/virtualization