[kvm-devel] [PATCH/RFC 1/9] s390 host memory management changes.
From: Heiko Carstens [EMAIL PROTECTED] Add changes to s390 memory management which are necessary to use the s390 hardware assisted virtualization facility. For this the upper halve of each page table needs to be reserved so the hardware can save extended page status bits for the guest and the host. Easy solution to this is to just change PTRS_PER_PTE and PTRS_PER_PMD accordingly, so the upper halves of the pages that contain page tables are unused and can be used by the hardware. Unfortunately with these #ifdef changes we need twice as much memory for processes, even for those which don't need to save extended status bits. Maybe a better solution would be to make PTRS_PER_PTE and PTRS_PER_PMD a per-process value and only double the size of the page tables if the process wants to make use of the virtualization instruction. Signed-off-by: Heiko Carstens [EMAIL PROTECTED] Signed-off-by: Carsten Otte [EMAIL PROTECTED] --- include/asm-s390/page.h|8 + include/asm-s390/pgalloc.h |5 + include/asm-s390/pgtable.h | 197 - 3 files changed, 209 insertions(+), 1 deletion(-) Index: linux-2.6.21/include/asm-s390/pgtable.h === --- linux-2.6.21.orig/include/asm-s390/pgtable.h +++ linux-2.6.21/include/asm-s390/pgtable.h @@ -65,7 +65,11 @@ extern char empty_zero_page[PAGE_SIZE]; # define PMD_SHIFT 22 # define PGDIR_SHIFT 22 #else /* __s390x__ */ +#ifdef CONFIG_S390_HOST +# define PMD_SHIFT 20 +#else # define PMD_SHIFT 21 +#endif # define PGDIR_SHIFT 31 #endif /* __s390x__ */ @@ -85,8 +89,13 @@ extern char empty_zero_page[PAGE_SIZE]; # define PTRS_PER_PMD1 # define PTRS_PER_PGD512 #else /* __s390x__ */ +#ifdef CONFIG_S390_HOST +# define PTRS_PER_PTE256 +# define PTRS_PER_PMD2048 +#else # define PTRS_PER_PTE512 # define PTRS_PER_PMD1024 +#endif # define PTRS_PER_PGD2048 #endif /* __s390x__ */ @@ -217,6 +226,18 @@ extern unsigned long vmalloc_end; #define _PAGE_SWT 0x001 /* SW pte type bit t */ #define _PAGE_SWX 0x002 /* SW pte type bit x */ +#ifdef CONFIG_S390_HOST +#define _PAGE_SOFT_REFERENCED 0x4 +#define _PAGE_SOFT_CHANGED 0x8 + +/* Page status extended */ +#define _PAGE_RCP_PCL 0x0080UL +#define _PAGE_RCP_HR 0x0040UL +#define _PAGE_RCP_HC 0x0020UL +#define _PAGE_RCP_GR 0x0004UL +#define _PAGE_RCP_GC 0x0002UL +#endif + /* Six different types of pages. */ #define _PAGE_TYPE_EMPTY 0x400 #define _PAGE_TYPE_NONE0x401 @@ -514,6 +535,9 @@ static inline int pte_write(pte_t pte) static inline int pte_dirty(pte_t pte) { +#ifdef CONFIG_S390_HOST + return (pte_val(pte) _PAGE_SOFT_CHANGED) != 0; +#endif /* A pte is neither clean nor dirty on s/390. The dirty bit * is in the storage key. See page_test_and_clear_dirty for * details. @@ -523,6 +547,9 @@ static inline int pte_dirty(pte_t pte) static inline int pte_young(pte_t pte) { +#ifdef CONFIG_S390_HOST + return (pte_val(pte) _PAGE_SOFT_REFERENCED) != 0; +#endif /* A pte is neither young nor old on s/390. The young bit * is in the storage key. See page_test_and_clear_young for * details. @@ -582,7 +609,9 @@ static inline void pgd_clear(pgd_t * pgd static inline void pmd_clear_kernel(pmd_t * pmdp) { pmd_val(*pmdp) = _PMD_ENTRY_INV | _PMD_ENTRY; +#ifndef CONFIG_S390_HOST pmd_val1(*pmdp) = _PMD_ENTRY_INV | _PMD_ENTRY; +#endif } static inline void pmd_clear(pmd_t * pmdp) @@ -632,6 +661,9 @@ static inline pte_t pte_mkwrite(pte_t pt static inline pte_t pte_mkclean(pte_t pte) { +#ifdef CONFIG_S390_HOST + pte_val(pte) = ~_PAGE_SOFT_CHANGED; +#endif /* The only user of pte_mkclean is the fork() code. We must *not* clear the *physical* page dirty bit just because fork() wants to clear the dirty bit in @@ -641,6 +673,9 @@ static inline pte_t pte_mkclean(pte_t pt static inline pte_t pte_mkdirty(pte_t pte) { +#ifdef CONFIG_S390_HOST + pte_val(pte) |= _PAGE_SOFT_CHANGED; +#endif /* We do not explicitly set the dirty bit because the * sske instruction is slow. It is faster to let the * next instruction set the dirty bit. @@ -650,6 +685,9 @@ static inline pte_t pte_mkdirty(pte_t pt static inline pte_t pte_mkold(pte_t pte) { +#ifdef CONFIG_S390_HOST + pte_val(pte) = ~_PAGE_SOFT_REFERENCED; +#endif /* S/390 doesn't keep its dirty/referenced bit in the pte. * There is no point in clearing the real referenced bit. */ @@ -658,14 +696,111 @@ static inline pte_t pte_mkold(pte_t pte) static inline pte_t pte_mkyoung(pte_t pte) { +#ifdef CONFIG_S390_HOST + pte_val(pte) |= _PAGE_SOFT_REFERENCED; +#endif /* S/390 doesn't keep its dirty/referenced bit in the pte. *
[kvm-devel] [PATCH/RFC 2/9] s390 virtualization interface
From: Heiko Carstens [EMAIL PROTECTED] Add interface which allows a process to start a virtual machine. To keep things easy each thread group is allowed to have only one virtual machine and each thread of the thread group can only control one virtual cpu of the virtual machine. All the information about the virtual machines/cpus can be found via the thread_info structures of the participating threads. This patch adds three new s390 specific system calls: long sys_s390host_add_cpu(unsigned long addr, unsigned long flags, struct sie_block __user *sie_template) Adds a new cpu to a the virtual machine that belongs to the current thread group. If no virtual machine exists it will be created. In addition two pages will be allocated and mapped at addr into the address space of the process. These two pages are used so user space and kernel space can easily exchange/modify the state of the corresponding virtual cpu without a ton of copy_from/to_user calls. The sie_template is a pointer to a data structure that contains initial information how the virtual cpu should be setup. The resulting block will be used as a parameter to issue the sie (start interpretive execution) instruction which starts a virtual cpu. int sys_s390host_remove_cpu(void) Removes a virtual cpu from a virtual machine. int sys_s390host_sie(unsigned long action) Starts / re-enters the virtual cpu of the virtual machine that the thread belongs to, if any. Please note that this patch is nothing more than a proof-of-concept and may contain quite a few bugs. Since we want to convert to use kvm instead, most of this will be dropped anyway. But maybe this is of interest for others as well. Signed-off-by: Heiko Carstens [EMAIL PROTECTED] Signed-off-by: Carsten Otte [EMAIL PROTECTED] --- arch/s390/Kconfig |7 arch/s390/Makefile |2 arch/s390/host/Makefile |5 arch/s390/host/s390_intercept.c | 42 arch/s390/host/s390host.c | 418 arch/s390/host/s390host.h | 16 + arch/s390/host/sie64a.S | 38 +++ arch/s390/kernel/asm-offsets.c |2 arch/s390/kernel/process.c | 15 + arch/s390/kernel/setup.c|4 arch/s390/kernel/syscalls.S |3 include/asm-s390/sie64.h| 279 ++ include/asm-s390/thread_info.h |8 include/asm-s390/unistd.h |5 kernel/sys_ni.c |3 15 files changed, 842 insertions(+), 5 deletions(-) Index: linux-2.6.21/arch/s390/kernel/asm-offsets.c === --- linux-2.6.21.orig/arch/s390/kernel/asm-offsets.c +++ linux-2.6.21/arch/s390/kernel/asm-offsets.c @@ -44,5 +44,7 @@ int main(void) DEFINE(__SF_BACKCHAIN, offsetof(struct stack_frame, back_chain),); DEFINE(__SF_GPRS, offsetof(struct stack_frame, gprs),); DEFINE(__SF_EMPTY, offsetof(struct stack_frame, empty1),); + BLANK(); + DEFINE(__SIE_USER_gprs, offsetof(struct sie_user, gprs),); return 0; } Index: linux-2.6.21/arch/s390/kernel/syscalls.S === --- linux-2.6.21.orig/arch/s390/kernel/syscalls.S +++ linux-2.6.21/arch/s390/kernel/syscalls.S @@ -322,3 +322,6 @@ NI_SYSCALL /* 310 sys_move_pages * SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper) SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper) SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper) +SYSCALL(sys_ni_syscall,sys_s390host_add_cpu,sys_ni_syscall) +SYSCALL(sys_ni_syscall,sys_s390host_remove_cpu,sys_ni_syscall) +SYSCALL(sys_ni_syscall,sys_s390host_sie,sys_ni_syscall) Index: linux-2.6.21/arch/s390/host/Makefile === --- /dev/null +++ linux-2.6.21/arch/s390/host/Makefile @@ -0,0 +1,5 @@ +# +# Makefile for the s390host components. +# + +obj-$(CONFIG_S390_HOST)+= s390host.o sie64a.o s390_intercept.o Index: linux-2.6.21/arch/s390/host/sie64a.S === --- /dev/null +++ linux-2.6.21/arch/s390/host/sie64a.S @@ -0,0 +1,38 @@ +/* + * arch/s390/host/sie64a.S + *low level sie call + * + *Copyright IBM Corp. 2007 + *Author(s): Heiko Carstens [EMAIL PROTECTED] + *License : GPL + */ + +#include linux/errno.h +#include asm/asm-offsets.h + +SP_R6 =6 * 8 # offset into stackframe + + .globl sie64a +sie64a: + stmg%r6,%r15,SP_R6(%r15)# save register on entry + lgr %r14,%r2# pointer to program parms + aghi%r2,4096 + lmg %r0,%r13,__SIE_USER_gprs(%r2) # load guest gprs 0-13 +sie_inst: + sie 0(%r14) + aghi%r14,4096 + stmg%r0,%r13,__SIE_USER_gprs(%r14) # save guest gprs 0-13 + lghi%r2,0 +
[kvm-devel] [PATCH/RFC 3/9] s390 guest detection
From: Christian Borntraeger [EMAIL PROTECTED] This patch adds functionality to detect if the kernel runs under an s390host hypervisor. A macro MACHINE_IS_GUEST is exported for device drivers. This allows drivers to skip device detection if the systems runs non-virtualized. Signed-off-by: Christian Borntraeger [EMAIL PROTECTED] Signed-off-by: Carsten Otte [EMAIL PROTECTED] --- arch/s390/kernel/early.c |4 arch/s390/kernel/setup.c |9 ++--- include/asm-s390/setup.h |1 + 3 files changed, 11 insertions(+), 3 deletions(-) Index: linux-2.6.21/arch/s390/kernel/setup.c === --- linux-2.6.21.orig/arch/s390/kernel/setup.c +++ linux-2.6.21/arch/s390/kernel/setup.c @@ -744,9 +744,12 @@ setup_arch(char **cmdline_p) This machine has an IEEE fpu\n : This machine has no IEEE fpu\n); #else /* CONFIG_64BIT */ - printk((MACHINE_IS_VM) ? - We are running under VM (64 bit mode)\n : - We are running native (64 bit mode)\n); + if (MACHINE_IS_VM) + printk(We are running under VM (64 bit mode)\n); + else if (MACHINE_IS_GUEST) + printk(We are running on a non z/VM host\n); + else + printk(We are running native (64 bit mode)\n); #endif /* CONFIG_64BIT */ /* Save unparsed command line copy for /proc/cmdline */ Index: linux-2.6.21/include/asm-s390/setup.h === --- linux-2.6.21.orig/include/asm-s390/setup.h +++ linux-2.6.21/include/asm-s390/setup.h @@ -61,6 +61,7 @@ extern unsigned long machine_flags; #define MACHINE_IS_VM (machine_flags 1) #define MACHINE_IS_P390(machine_flags 4) #define MACHINE_HAS_MVPG (machine_flags 16) +#define MACHINE_IS_GUEST (machine_flags 64) #define MACHINE_HAS_IDTE (machine_flags 128) #define MACHINE_HAS_DIAG9C (machine_flags 256) Index: linux-2.6.21/arch/s390/kernel/early.c === --- linux-2.6.21.orig/arch/s390/kernel/early.c +++ linux-2.6.21/arch/s390/kernel/early.c @@ -139,6 +139,10 @@ static noinline __init void detect_machi /* Running on a P/390 ? */ if (cpuinfo-cpu_id.machine == 0x7490) machine_flags |= 4; + + /* Running under a host ? */ + if (cpuinfo-cpu_id.version == 0xfe) + machine_flags |= 64; } #ifdef CONFIG_64BIT - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH/RFC 5/9] s390 virtual console for guests
From: Carsten Otte [EMAIL PROTECTED] This driver provides a simple virtualized console. Userspace can use read/write to its console to pass the data to the host. Signed-off-by: Carsten Otte [EMAIL PROTECTED] --- drivers/s390/Kconfig |5 + drivers/s390/guest/Makefile|1 drivers/s390/guest/guest_console.c | 72 + drivers/s390/guest/guest_console.h | 47 +++ drivers/s390/guest/guest_tty.c | 153 + 5 files changed, 278 insertions(+) Index: linux-2.6.21/drivers/s390/guest/guest_console.c === --- /dev/null +++ linux-2.6.21/drivers/s390/guest/guest_console.c @@ -0,0 +1,72 @@ +/* + * guest console device driver + * Copyright IBM Corp. 2007 + * Author: Carsten Otte [EMAIL PROTECTED] + */ + +#include linux/kernel.h +#include linux/types.h +#include linux/console.h +#include linux/string.h +#include linux/init.h +#include linux/errno.h +#include guest_console.h + +#define guest_console_major 4 /* TTYAUX_MAJOR */ +#define guest_console_minor 65 +#define guest_console_name ttyS + +static void guest_console_write(struct console *console, const char *string, +unsigned len) +{ + int ret; + size_t pos; + + for(pos=0; pos strlen(string); pos += ret) { + ret = diag_write(1, string + pos, len - pos); + if (ret = 0) + break; + } +} + +static struct tty_driver * +guest_console_device(struct console *c, int *index) +{ + *index = c-index; + return guest_tty_driver; +} + +static void +guest_console_unblank(void) +{ + return; +} + +static struct console guest_console = +{ + .name = guest_console_name, + .write = guest_console_write, + .device = guest_console_device, + .unblank = guest_console_unblank, + .flags = CON_PRINTBUFFER, + .index = 0 /* ttyS0 */ +}; + +/* + * called by console_init() in drivers/char/tty_io.c at boot-time. + */ +static int __init +guest_console_init(void) +{ + if (!MACHINE_IS_GUEST) + return 0; + + printk (KERN_INFO z/Live console initialized\n); + + /* enable printk-access to this driver */ + register_console(guest_console); + return 0; +} + +console_initcall(guest_console_init); + Index: linux-2.6.21/drivers/s390/guest/guest_console.h === --- /dev/null +++ linux-2.6.21/drivers/s390/guest/guest_console.h @@ -0,0 +1,47 @@ +/* + * guest console device driver + * Copyright IBM Corp. 2007 + * Author: Carsten Otte [EMAIL PROTECTED] + */ + + +#ifndef __GCONSOLE_H +#define __GCONSOLE_H +extern struct tty_driver *guest_tty_driver; +static inline int diag_write(int fd, const void *buffer, size_t count) +{ + register long __arg1 asm(2) = fd; + register const void * __arg2 asm(3) = buffer; + register size_t __arg3 asm(4) = count; + register long __svcres asm(2); + long __res; + asm volatile ( + diag 0,0,2 + : =d (__svcres) + : 0 (__arg1), + d (__arg2), + d (__arg3) + : cc, memory); + __res = __svcres; + return __res; +} + +static inline int diag_read(int fd, const void *buffer, size_t count) +{ + register long __arg1 asm(2) = fd; + register const void * __arg2 asm(3) = buffer; + register size_t __arg3 asm(4) = count; + register long __svcres asm(2); + long __res; + asm volatile ( + diag 0,0,1 + : =d (__svcres) + : 0 (__arg1), + d (__arg2), + d (__arg3) + : cc, memory); + __res = __svcres; + return __res; +} +#endif + Index: linux-2.6.21/drivers/s390/guest/guest_tty.c === --- /dev/null +++ linux-2.6.21/drivers/s390/guest/guest_tty.c @@ -0,0 +1,153 @@ +/* + * guest console tty device driver + * Copyright IBM Corp. 2007 + * Author: Carsten Otte [EMAIL PROTECTED] + */ + +#include linux/fs.h +#include linux/tty.h +#include linux/tty_flip.h +#include linux/module.h +#include asm/s390_ext.h +#include guest_console.h + +struct tty_driver *guest_tty_driver; +static struct tty_struct *guest_tty; + +MODULE_DESCRIPTION(Guest console for linux guests); +MODULE_AUTHOR(Carsten Otte [EMAIL PROTECTED]); +MODULE_LICENSE(GPL); + +static int +guest_tty_open(struct tty_struct *tty, struct file *filp) +{ + guest_tty = tty; + tty-driver_data = NULL; + return 0; +} + +static void +guest_tty_close(struct tty_struct *tty, struct file *filp) +{ + if (tty-count 1) + return; + guest_tty = NULL; +} + +static int +guest_tty_ioctl(struct tty_struct *tty, struct file * file, + unsigned int cmd, unsigned long arg) +{ +
[kvm-devel] [PATCH/RFC 9/9] Fix system-user misaccount of interpreted execution
From: Christian Borntraeger [EMAIL PROTECTED] This patches fixes the accouting of guest cpu time. As sie is executed via a system call, all guest operations were accounted as system time. To fix this we define a per thread sie context. Before issuing the sie instruction we enter this context and leave the context afterwards. sie_enter and sie_exit call account_system_vtime, which now checks for being in sie_context. We define the sie_context to be accounted as user time. Possible future enhancement: We could add an additional field: interpretion time to cpu stat and process time. Thus we could differentiate between user time in the host and host user time spent for guests. The main challenge is the necessary user space change. Therefore, we could export the interpretion time with a new interface. To be defined. Signed-off-By: Christian Borntraeger [EMAIL PROTECTED] Signed-off-By: Carsten Otte [EMAIL PROTECTED] --- arch/s390/Kconfig |1 + arch/s390/host/s390host.c | 15 +++ arch/s390/kernel/process.c |1 + arch/s390/kernel/vtime.c | 11 ++- include/asm-s390/thread_info.h |2 ++ 5 files changed, 29 insertions(+), 1 deletion(-) Index: linux-2.6.21/arch/s390/kernel/vtime.c === --- linux-2.6.21.orig/arch/s390/kernel/vtime.c +++ linux-2.6.21/arch/s390/kernel/vtime.c @@ -97,6 +97,11 @@ void account_vtime(struct task_struct *t account_system_time(tsk, 0, cputime); } +static inline int task_is_in_sie(struct thread_info *thread) +{ + return thread-in_sie; +} + /* * Update process times based on virtual cpu times stored by entry.S * to the lowcore fields user_timer, system_timer steal_clock. @@ -114,7 +119,11 @@ void account_system_vtime(struct task_st cputime = S390_lowcore.system_timer 12; S390_lowcore.system_timer -= cputime 12; S390_lowcore.steal_clock -= cputime 12; - account_system_time(tsk, 0, cputime); + + if (task_is_in_sie(tsk-thread_info) !hardirq_count() !softirq_count()) + account_user_time(tsk, cputime); + else + account_system_time(tsk, 0, cputime); } static inline void set_vtimer(__u64 expires) Index: linux-2.6.21/arch/s390/host/s390host.c === --- linux-2.6.21.orig/arch/s390/host/s390host.c +++ linux-2.6.21/arch/s390/host/s390host.c @@ -27,6 +27,19 @@ static int s390host_do_action(unsigned l static DEFINE_MUTEX(s390host_init_mutex); +static void enter_sie(void) +{ + account_system_vtime(current); + current_thread_info()-in_sie = 1; +} + +static void exit_sie(void) +{ + account_system_vtime(current); + current_thread_info()-in_sie = 0; +} + + static void s390host_get_data(struct s390host_data *data) { atomic_inc(data-count); @@ -297,7 +310,9 @@ again: schedule(); sie_kernel-sie_block.icptcode = 0; + enter_sie(); ret = sie64a(sie_kernel); + exit_sie(); if (ret) goto out; Index: linux-2.6.21/include/asm-s390/thread_info.h === --- linux-2.6.21.orig/include/asm-s390/thread_info.h +++ linux-2.6.21/include/asm-s390/thread_info.h @@ -55,6 +55,7 @@ struct thread_info { struct restart_blockrestart_block; struct s390host_data*s390host_data; /* s390host data */ int sie_cpu;/* sie cpu number */ + int in_sie; /* 1 = cpu is in sie*/ }; /* @@ -72,6 +73,7 @@ struct thread_info { }, \ .s390host_data = NULL, \ .sie_cpu= 0,\ + .in_sie = 0,\ } #define init_thread_info (init_thread_union.thread_info) Index: linux-2.6.21/arch/s390/kernel/process.c === --- linux-2.6.21.orig/arch/s390/kernel/process.c +++ linux-2.6.21/arch/s390/kernel/process.c @@ -278,6 +278,7 @@ int copy_thread(int nr, unsigned long cl memset(p-thread.per_info,0,sizeof(p-thread.per_info)); p-thread_info-s390host_data = NULL; p-thread_info-sie_cpu = -1; + p-thread_info-in_sie = 0; return 0; } Index: linux-2.6.21/arch/s390/Kconfig === --- linux-2.6.21.orig/arch/s390/Kconfig +++ linux-2.6.21/arch/s390/Kconfig @@ -519,6 +519,7 @@ config S390_HOST bool s390 host support (EXPERIMENTAL) depends on 64BIT EXPERIMENTAL select S390_SWITCH_AMODE + select VIRT_CPU_ACCOUNTING help Select this option if you want to host guest Linux images - This SF.net email is
Re: [kvm-devel] [PATCH/RFC 5/9] s390 virtual console for guests
I think it would be better to use hvc_console as Xen now uses it too. Carsten Otte wrote: + if (!MACHINE_IS_GUEST) + return 0; + register_external_interrupt(0x1234, guest_tty_ext_handler); This is an interesting way to get input data from the console :-) How many interrupts does s390 support (the x86 only supports 256)? Can you afford to burn interrupts like this? Is there not a better way to assign interrupts such that conflict isn't an issue? Regards, Anthony Liguori - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH/RFC 5/9] s390 virtual console for guests
On Friday 11 May 2007 21:00, Anthony Liguori wrote: I think it would be better to use hvc_console as Xen now uses it too. I dont know hvc_console, but I will have a look at it. Carsten Otte wrote: + if (!MACHINE_IS_GUEST) + return 0; + register_external_interrupt(0x1234, guest_tty_ext_handler); This is an interesting way to get input data from the console :-) How many interrupts does s390 support (the x86 only supports 256)? Can you afford to burn interrupts like this? Is there not a better way to assign interrupts such that conflict isn't an issue? On s390 we have a 16 bit interrupt code, so we actually have plenty of numbers... But, yes its a very good point, burning interrupts wont work cross-platform. Our patches are prototypes and need rework anyway. Take these patches as discussion contribution in the spirit of release early. :-) cheers Christian - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH/RFC 7/9] Virtual network guest device driver
Let me ask what may seem to be a naive question to the linux world. I see you are doing a lot off solid work on adding block and network devices. The code for block and network devices is implemented in different ways. I've also seen this difference of inerface/implementation on Xen. Hence my question: Why are the INTERFACES to the block and network devices different? I can understand that the implementation -- what goes on inside the box -- would be different. But, again, why is the interface to the resource different in each case? Will every distinct type of I/O device end up with a different interface? These questions doubtless seem naive, I suppose, except I use a system (Plan 9) in which a common interface is in fact used for the different resources. I have been hoping that we could bring this model -- same interface, different resource -- to the inter-vm communications. I would like to at least raise the idea that it could be used on KVM. Avoiding too much detail, in the plan 9 world, read and write of data to a disk is via file read and write system calls. Same for a network. Same for the mouse, the window system, the serial port, the console, USB, and so on. Please see this note from IBM on what is possible:http://domino.watson.ibm.com/library/CyberDig.nsf/0/c6c779bbf1650fa4852570670054f3ca?OpenDocument or http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf Different resources, same interface. In the hypervisor world, you build one shared memory queue as a basic abstraction. On top of that queue, you run 9P. The provider (network, block device, etc.) provides certain resources to you, the guest domain The resources have names. A network can look like this, to a kvm guest (this command from a Plan 9 system): cpu% ls /net/ether0 /net/ether0/0 /net/ether0/1 /net/ether0/2 /net/ether0/addr /net/ether0/clone /net/ether0/ifstats /net/ether0/stats To get network stats, or do I/O, one simply gains access to the appropriate ring buffer, by finding the name, and does the ring buffer sends and receives via shared memory queues. The I/O operations can be very efficient. Disk looks like this: cpu% ls -l /dev/sdC0 --rw-r- S 0 bootes bootes 104857600 Jan 22 15:49 /dev/sdC0/9fat --rw-r- S 0 bootes bootes 65361213440 Jan 22 15:49 /dev/sdC0/arenas --rw-r- S 0 bootes bootes 0 Jan 22 15:49 /dev/sdC0/ctl --rw-r- S 0 bootes bootes 82348277760 Jan 22 15:49 /dev/sdC0/data --rw-r- S 0 bootes bootes 13072242688 Jan 22 15:49 /dev/sdC0/fossil --rw-r- S 0 bootes bootes 3268060672 Jan 22 15:49 /dev/sdC0/isect --rw-r- S 0 bootes bootes 512 Jan 22 15:49 /dev/sdC0/nvram --rw-r- S 0 bootes bootes 82343245824 Jan 22 15:49 /dev/sdC0/plan9 -lrw--- S 0 bootes bootes 0 Jan 22 15:49 /dev/sdC0/raw --rw-r- S 0 bootes bootes 536870912 Jan 22 15:49 /dev/sdC0/swap cpu% So the disk partitions are files, with the data file being the whole disk. Again, on a hypervisor system, to do I/O, software could create a connection to the file and establish the in-memory ring buffer, for that partition. This I/O can be very efficient; IBM research is working on zero-copy mechanisms for moving data between domains. The result is a single, consistent mechanism for accessing all resources from a guest domain. The resources have names, and it is easy to examine the status -- binary interfaces can be minimized. The resources can be provided by in-kernel servers -- Linux drivers -- or out-of-kernel servers -- proceses. Same interface, and yet the implementation of the provider of the resource can be utterly different. We had hoped to get something like this into Xen. On Xen, for example, the block device and ethernet device interfaces are as different as one could imagine. Disk I/O does not steal pages from the guest. The network does. Disk I/O is in 4k chunks, period, with a bitmap describing which of the 8 512-byte subunits are being sent. The enet device, on read, returns a page with your packet, but also potentially containing bits of other domain's packets too. The interfaces are as dissimilar as they can be, and I see no reason for such a huge variance between what are basically read/write devices. Another issue is that kvm, in its current form (-24) is beautifully simple. These additions seem to detract from the beauty a bit. Might it be worth taking a little time to consider these ideas in order to preserve the basic elegance of KVM? So, before we go too far down the Xen-like paravirtualized device route, can we discuss the way this ought to look a bit? thanks ron - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net
Re: [kvm-devel] [PATCH/RFC 4/9] Basic guest virtual devices infrastructure
On Friday 11 May 2007, Carsten Otte wrote: This patch adds support for a new bus type that manages paravirtualized devices. The bus uses the s390 diagnose instruction to query devices, and match them with the corresponding drivers. It seems that the diagnose instruction is really the only s390 specific thing in here, right? I guess this part of your series is the first one that we should have in an architecture independent way. There may also be the chance of merging this with existing virtual buses like the one for the ps3, which also just exists using hypercalls. +int vdev_match(struct device * dev, struct device_driver *drv) +{ + struct vdev *vdev = to_vdev(dev); + struct vdev_driver *vdrv = to_vdrv(drv); + + if (vdev-vdev_type == vdrv-vdev_type) + return 1; + + return 0; +} Why invent device type numbers? On open firmware, we just do a string compare, which more intuitive, and means you don't need any further +int vdev_probe(struct device * dev) +{ + struct vdev *vdev = to_vdev(dev); + struct vdev_driver *vdrv = to_vdrv(dev-driver); + + return vdrv-probe(vdev); +} This abstraction is unnecessary, just do the do_vdev() conversion inside of the individual drivers. + +struct device vdev_bus = { + .bus_id = vdev0, + .release = vdev_bus_release +}; +static void vdev_bus_release (struct device *device) +{ + /* noop, static bus object */ +} Just make the root of your devices a platform_device, then you don't need to do dirty tricks like this. +static int vdev_scan_coldplug(void) +{ + int rc; + struct vdev *device; + + do { + device = kzalloc(sizeof(struct vdev), GFP_ATOMIC); + if (!device) { + rc = -ENOMEM; + goto out; + } + rc = vdev_diag_hotplug(device-symname, device-hostid); + if (rc == -ENODEV) + break; + if (rc 0) { + printk (KERN_WARNING vdev: error %d detecting \ + initial devices\n, rc); + break; + } + device-vdev_type = rc; + + //sanity: are strings terminated? + if ((strnlen(device-symname, 128) == 128) || + (strnlen(device-hostid, 128) == 128)) { + // warn and discard device + printk (vdev: illegal device entry received\n); + break; + } + + rc = vdevice_register(device); + if (rc) { + kfree(device); + } else + switch (device-vdev_type) { + case VDEV_TYPE_DISK: + printk (KERN_INFO vdev: storage device \ + detected: %s\n, device-symname); + break; + case VDEV_TYPE_NET: + printk (KERN_INFO vdev: network device \ + detected: %s\n, device-symname); + break; + default: + printk (KERN_INFO vdev: unknown device \ + detected: %s\n, device-symname); + } + } while(1); + kfree (device); + out: + return 0; +} Interesting concept of probing the bus -- so you just ask if there are any new devices, right? +#define VDEV_TYPE_DISK 0 +#define VDEV_TYPE_NET 1 + +struct vdev { + unsigned intvdev_type; + charsymname[128]; + charhostid[128]; + struct vdev_driver *driver; + struct device dev; + void*drv_private; +}; You shouldn't need the driver and drv_private fields -- they are already present in struct device. Arnd - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH/RFC 7/9] Virtual network guest device driver
ron minnich wrote: Avoiding too much detail, in the plan 9 world, read and write of data to a disk is via file read and write system calls. For low speed devices, I think paravirtualization doesn't make a lot of sense unless it's absolutely required. I don't know enough about s390 to know if it supports things like uarts but if so, then emulating a uart would in my mind make a lot more sense than a PV console device. Same for a network. Same for the mouse, the window system, the serial port, the console, USB, and so on. Please see this note from IBM on what is possible:http://domino.watson.ibm.com/library/CyberDig.nsf/0/c6c779bbf1650fa4852570670054f3ca?OpenDocument or http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf Different resources, same interface. In the hypervisor world, you build one shared memory queue as a basic abstraction. On top of that queue, you run 9P. The provider (network, block device, etc.) provides certain resources to you, the guest domain The resources have names. A network can look like this, to a kvm guest (this command from a Plan 9 system): cpu% ls /net/ether0 /net/ether0/0 /net/ether0/1 /net/ether0/2 /net/ether0/addr /net/ether0/clone /net/ether0/ifstats /net/ether0/stats This smells a bit like XenStore which I think most will agree was an unmitigated disaster. This sort of thing gets terribly complicated to deal with in the corner cases. Atomic operation of multiple read/write operations is difficult to express. Moreover, quite a lot of things are naturally expressed as a state machine which is not straight forward to do in this sort of model. This may have been all figured out in 9P but it's certainly not a simple thing to get right. I think a general rule of thumb for a virtualized environment is that the closer you stick to the way hardware tends to do things, the less likely you are to screw yourself up and the easier it will be for other platforms to support your devices. Implementing a full 9P client just to get console access in something like mini-os would be unfortunate. At least the posted s390 console driver behaves roughly like a uart so it's pretty obvious that it will be easy to implement in any OS that supports uarts already. Regards, Anthony Liguori To get network stats, or do I/O, one simply gains access to the appropriate ring buffer, by finding the name, and does the ring buffer sends and receives via shared memory queues. The I/O operations can be very efficient. Disk looks like this: cpu% ls -l /dev/sdC0 --rw-r- S 0 bootes bootes 104857600 Jan 22 15:49 /dev/sdC0/9fat --rw-r- S 0 bootes bootes 65361213440 Jan 22 15:49 /dev/sdC0/arenas --rw-r- S 0 bootes bootes 0 Jan 22 15:49 /dev/sdC0/ctl --rw-r- S 0 bootes bootes 82348277760 Jan 22 15:49 /dev/sdC0/data --rw-r- S 0 bootes bootes 13072242688 Jan 22 15:49 /dev/sdC0/fossil --rw-r- S 0 bootes bootes 3268060672 Jan 22 15:49 /dev/sdC0/isect --rw-r- S 0 bootes bootes 512 Jan 22 15:49 /dev/sdC0/nvram --rw-r- S 0 bootes bootes 82343245824 Jan 22 15:49 /dev/sdC0/plan9 -lrw--- S 0 bootes bootes 0 Jan 22 15:49 /dev/sdC0/raw --rw-r- S 0 bootes bootes 536870912 Jan 22 15:49 /dev/sdC0/swap cpu% So the disk partitions are files, with the data file being the whole disk. Again, on a hypervisor system, to do I/O, software could create a connection to the file and establish the in-memory ring buffer, for that partition. This I/O can be very efficient; IBM research is working on zero-copy mechanisms for moving data between domains. The result is a single, consistent mechanism for accessing all resources from a guest domain. The resources have names, and it is easy to examine the status -- binary interfaces can be minimized. The resources can be provided by in-kernel servers -- Linux drivers -- or out-of-kernel servers -- proceses. Same interface, and yet the implementation of the provider of the resource can be utterly different. We had hoped to get something like this into Xen. On Xen, for example, the block device and ethernet device interfaces are as different as one could imagine. Disk I/O does not steal pages from the guest. The network does. Disk I/O is in 4k chunks, period, with a bitmap describing which of the 8 512-byte subunits are being sent. The enet device, on read, returns a page with your packet, but also potentially containing bits of other domain's packets too. The interfaces are as dissimilar as they can be, and I see no reason for such a huge variance between what are basically read/write devices. Another issue is that kvm, in its current form (-24) is beautifully simple. These additions seem to detract from the beauty a bit. Might it be worth taking a little time to consider these ideas in order to preserve the basic elegance of KVM? So, before we go too far down the Xen-like paravirtualized device route, can we discuss
Re: [kvm-devel] [PATCH/RFC 8/9] Virtual network host switch support
Carsten Otte wrote: From: Christian Borntraeger [EMAIL PROTECTED] This is the host counterpart for the virtual network device driver. This driver has an char device node where the hypervisor can attach. It also has a kind of dumb switch that passes packets between guests. Last but not least it contains a host network interface. Patches for attaching other host network devices to the switch via raw sockets, extensions to qeth or netfilter are Any feel for the performance relative to the bridging code? The bridging code is a pretty big bottle neck in guest=guest communications in Xen at least. currently tested but not ready yet. We did not use the linux bridging code to allow non-root users to create virtual networks between guests. Is that the primary reason? If so, that seems like a rather large hammer for something that a userspace suid wrapper could have addressed... Regards, Anthony Liguori Signed-off-by: Christian Borntraeger [EMAIL PROTECTED] Signed-off-by: Carsten Otte [EMAIL PROTECTED] --- drivers/s390/guest/Makefile |3 drivers/s390/guest/vnet_port_guest.c | 302 drivers/s390/guest/vnet_port_guest.h | 21 drivers/s390/guest/vnet_port_host.c | 418 + drivers/s390/guest/vnet_port_host.h | 18 drivers/s390/guest/vnet_switch.c | 828 +++ drivers/s390/guest/vnet_switch.h | 119 + drivers/s390/net/Kconfig | 12 8 files changed, 1721 insertions(+) Index: linux-2.6.21/drivers/s390/guest/vnet_port_guest.c === --- /dev/null +++ linux-2.6.21/drivers/s390/guest/vnet_port_guest.c @@ -0,0 +1,302 @@ +/* + * Copyright (C) 2005 IBM Corporation + * Authors: Carsten Otte [EMAIL PROTECTED] + * Christian Borntraeger [EMAIL PROTECTED] + * + */ +#include linux/etherdevice.h +#include linux/fs.h +#include linux/kernel.h +#include linux/list.h +#include linux/module.h +#include linux/pagemap.h +#include linux/poll.h +#include linux/spinlock.h + +#include vnet.h +#include vnet_port_guest.h +#include vnet_switch.h + +static void COFIXME_add_irq(struct vnet_guest_port *zgp, int data) +{ + int oldval, newval; + + do { + oldval = atomic_read(zgp-pending_irqs); + newval = oldval | data; + } while (atomic_cmpxchg(zgp-pending_irqs, oldval, newval) != oldval); +} + +static int COFIXME_get_irq(struct vnet_guest_port *zgp) +{ + int oldval; + + do { + oldval = atomic_read(zgp-pending_irqs); + } while (atomic_cmpxchg(zgp-pending_irqs, oldval, 0) != oldval); + + return oldval; +} + +static void +vnet_guest_interrupt(struct vnet_port *port, int type) +{ + struct vnet_guest_port *priv; + + priv = port-priv; + + if (!priv-fasync) { + printk (KERN_WARNING vnet: cannot send interrupt, + fd not async\n); + return; + } + switch (type) { + case VNET_IRQ_START_RX: + COFIXME_add_irq(priv, POLLIN); + kill_fasync(priv-fasync, SIGIO, POLL_IN); + break; + case VNET_IRQ_START_TX: + COFIXME_add_irq(priv, POLLOUT); + kill_fasync(priv-fasync, SIGIO, POLL_OUT); + break; + default: + BUG(); + } +} + +/* release all pinned user pages*/ +static void +vnet_guest_release_pages(struct vnet_port *port) +{ + int i,j; + + for (i=0; iVNET_QUEUE_LEN; i++) + for (j=0; jVNET_BUFFER_PAGES; j++) { + if (port-s2p_data[i][j]) { + page_cache_release(virt_to_page(port-s2p_data[i][j])); + port-s2p_data[i][j] = NULL; + } + if (port-p2s_data[i][j]) { + page_cache_release(virt_to_page(port-p2s_data[i][j])); + port-p2s_data[i][j] = NULL; + } + } + if (port-control) { + page_cache_release(virt_to_page(port-control)); + port-control = NULL; + } +} + +static int +vnet_chr_open(struct inode *ino, struct file *filp) +{ + int minor; + struct vnet_port *port; + char name[BUS_ID_SIZE]; + + minor = iminor(filp-f_dentry-d_inode); + snprintf(name, BUS_ID_SIZE, guest:%d, current-pid); + port = vnet_port_get(minor, name); + if (!port) + return -ENODEV; + port-priv = kzalloc(sizeof(struct vnet_guest_port), GFP_KERNEL); + if (!port-priv) { + vnet_port_put(port); + return -ENOMEM; + } + port-interrupt = vnet_guest_interrupt; + filp-private_data = port; + return nonseekable_open(ino, filp); +} + +static int +vnet_chr_release (struct inode *ino, struct file *filp) +{ + struct
[kvm-devel] [PATCH 0/4] in-kernel APIC v3a (usermode side)
I re-worked the QEMU patches based on feedback from Dor and Anthony. Here is the changelog: 1) Got rid of the extern kvm_context from qemu/pc/apic.c. This function is now wrapped by qemu-kvm which assigns proper kvm_context on behalf of the caller. 2) Added support for a command line option: --kvm_apic [0 | 1 | 2]. The system defaults to level-1 mode (KVM based LAPIC). Level-0 (QEMU based LAPIC) is also supported. Level-2 is not supported yet, TBD. 3) Added the idea that Anthony proposed to have kvm_allowed=0 defined, even if USE_KVM is not. 4) Cleaned up indentation 5) Cleaned up support for level-0 mode. I have tested this code (in conjunction with the v3 kernel-patch) against A) 32 bit XP w/ACPI B) 64 bit SLED-10 (2.6.16 based) Any everything seems to be working great. Note that the current git-HEAD of the userspace code seems to break pretty badly for linux right now. I am testing exclusively on Intel chips (5130 Woodcrest and T7600 Merom), so YMMV. As such, these patches apply to git 7b9ee2382b07e955cc62a564406e3d9c4a08de6c. Any feedback at all would be appreciated (particularly news of succesfull testing :). Thanks! Regards, -Greg - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 1/4] KVM: Updates for compiling in-kernel APIC support with external-modules
Signed-off-by: Gregory Haskins [EMAIL PROTECTED] --- kernel/Kbuild |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/Kbuild b/kernel/Kbuild index e9bcda7..103a179 100644 --- a/kernel/Kbuild +++ b/kernel/Kbuild @@ -1,5 +1,5 @@ EXTRA_CFLAGS := -I$(src)/include -include $(src)/external-module-compat.h obj-m := kvm.o kvm-intel.o kvm-amd.o -kvm-objs := kvm_main.o mmu.o x86_emulate.o +kvm-objs := kvm_main.o mmu.o x86_emulate.o userint.o kernint.o lapic.o kvm-intel-objs := vmx.o vmx-debug.o kvm-amd-objs := svm.o - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 2/4] KVM-USER: Make the kvm_allowed flag always defined so we dont need #ifdefs
Non-performance critical code is made more awkward by having to always define both #ifdef KVM and if (kvm_allowed). Define kvm_allowed = 0 by default. Anthony Ligouri is credited with the idea. Signed-off-by: Gregory Haskins [EMAIL PROTECTED] --- qemu/qemu-kvm.c |9 - 1 files changed, 8 insertions(+), 1 deletions(-) diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c index 212570a..d4419a3 100644 --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -3,6 +3,14 @@ #include config-host.h #ifdef USE_KVM + #define KVM_ALLOWED_DEFAULT 1 +#else + #define KVM_ALLOWED_DEFAULT 0 +#endif + +int kvm_allowed = KVM_ALLOWED_DEFAULT; + +#ifdef USE_KVM #include exec.h @@ -14,7 +22,6 @@ extern void perror(const char *s); -int kvm_allowed = 1; kvm_context_t kvm_context; static struct kvm_msr_list *kvm_msr_list; static int kvm_has_msr_star; - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 3/4] KVM-USER: Add ability to specify APIC emulation type from the command-line
Signed-off-by: Gregory Haskins [EMAIL PROTECTED] --- qemu/qemu-kvm.c |1 + qemu/vl.c |5 + qemu/vl.h |1 + 3 files changed, 7 insertions(+), 0 deletions(-) diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c index d4419a3..faa4684 100644 --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -9,6 +9,7 @@ #endif int kvm_allowed = KVM_ALLOWED_DEFAULT; +int kvm_apic_level = 1; #ifdef USE_KVM diff --git a/qemu/vl.c b/qemu/vl.c index 7df1c80..88e650e 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -6531,6 +6531,7 @@ enum { QEMU_OPTION_vnc, QEMU_OPTION_no_acpi, QEMU_OPTION_no_kvm, +QEMU_OPTION_kvm_apic, QEMU_OPTION_no_reboot, QEMU_OPTION_daemonize, QEMU_OPTION_option_rom, @@ -6600,6 +6601,7 @@ const QEMUOption qemu_options[] = { #endif #ifdef USE_KVM { no-kvm, 0, QEMU_OPTION_no_kvm }, +{ kvm_apic, HAS_ARG, QEMU_OPTION_kvm_apic }, #endif #if defined(TARGET_PPC) || defined(TARGET_SPARC) { g, 1, QEMU_OPTION_g }, @@ -7309,6 +7311,9 @@ int main(int argc, char **argv) case QEMU_OPTION_no_kvm: kvm_allowed = 0; break; + case QEMU_OPTION_kvm_apic: + kvm_apic_level = optarg; + break; #endif case QEMU_OPTION_usb: usb_enabled = 1; diff --git a/qemu/vl.h b/qemu/vl.h index debd17c..dec410e 100644 --- a/qemu/vl.h +++ b/qemu/vl.h @@ -158,6 +158,7 @@ extern int graphic_depth; extern const char *keyboard_layout; extern int kqemu_allowed; extern int kvm_allowed; +extern int kvm_apic_level; extern int win2k_install_hack; extern int usb_enabled; extern int smp_cpus; - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 4/4] KVM: in-kernel-apic modification to QEMU
Signed-off-by: Gregory Haskins [EMAIL PROTECTED] --- qemu/hw/apic.c | 20 +++- qemu/hw/pc.c| 30 +- qemu/qemu-kvm.c | 49 +++-- qemu/qemu-kvm.h |2 ++ qemu/vl.c |2 +- qemu/vl.h |2 +- user/kvmctl.c | 33 - user/kvmctl.h | 31 ++- user/main.c |2 +- 9 files changed, 138 insertions(+), 33 deletions(-) diff --git a/qemu/hw/apic.c b/qemu/hw/apic.c index 0b73233..9ac9ae4 100644 --- a/qemu/hw/apic.c +++ b/qemu/hw/apic.c @@ -18,6 +18,7 @@ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ #include vl.h +#include qemu-kvm.h //#define DEBUG_APIC //#define DEBUG_IOAPIC @@ -87,6 +88,7 @@ typedef struct APICState { } APICState; struct IOAPICState { +CPUState *cpu_env; uint8_t id; uint8_t ioregsel; @@ -888,10 +890,17 @@ static void ioapic_service(IOAPICState *s) vector = pic_read_irq(isa_pic); else vector = entry 0xff; - -apic_get_delivery_bitmask(deliver_bitmask, dest, dest_mode); -apic_bus_deliver(deliver_bitmask, delivery_mode, - vector, polarity, trig_mode); + + if (kvm_allowed kvm_apic_level) { + ext_apic_bus_deliver(dest, trig_mode, dest_mode, +delivery_mode, vector); + cpu_interrupt(s-cpu_env, CPU_INTERRUPT_HARD); + } else { + apic_get_delivery_bitmask(deliver_bitmask, dest, + dest_mode); + apic_bus_deliver(deliver_bitmask, delivery_mode, +vector, polarity, trig_mode); + } } } } @@ -1045,7 +1054,7 @@ static CPUWriteMemoryFunc *ioapic_mem_write[3] = { ioapic_mem_writel, }; -IOAPICState *ioapic_init(void) +IOAPICState *ioapic_init(CPUState *env) { IOAPICState *s; int io_memory; @@ -1054,6 +1063,7 @@ IOAPICState *ioapic_init(void) if (!s) return NULL; ioapic_reset(s); +s-cpu_env = env; s-id = last_apic_id++; io_memory = cpu_register_io_memory(0, ioapic_mem_read, diff --git a/qemu/hw/pc.c b/qemu/hw/pc.c index eda49cf..b033637 100644 --- a/qemu/hw/pc.c +++ b/qemu/hw/pc.c @@ -91,16 +91,19 @@ int cpu_get_pic_interrupt(CPUState *env) { int intno; -intno = apic_get_interrupt(env); -if (intno = 0) { -/* set irq request if a PIC irq is still pending */ -/* XXX: improve that */ -pic_update_irq(isa_pic); -return intno; +if (!kvm_allowed || !kvm_apic_level) { + intno = apic_get_interrupt(env); + if (intno = 0) { + /* set irq request if a PIC irq is still pending */ + /* XXX: improve that */ + pic_update_irq(isa_pic); + return intno; + } + + /* read the irq from the PIC */ + if (!apic_accept_pic_intr(env)) + return -1; } -/* read the irq from the PIC */ -if (!apic_accept_pic_intr(env)) -return -1; intno = pic_read_irq(isa_pic); return intno; @@ -483,9 +486,10 @@ static void pc_init1(int ram_size, int vga_ram_size, int boot_device, } register_savevm(cpu, i, 4, cpu_save, cpu_load, env); qemu_register_reset(main_cpu_reset, env); -if (pci_enabled) { -apic_init(env); -} + if (!kvm_allowed || !kvm_apic_level) + if (pci_enabled) { + apic_init(env); + } } /* allocate RAM */ @@ -671,7 +675,7 @@ static void pc_init1(int ram_size, int vga_ram_size, int boot_device, register_ioport_write(0x92, 1, 1, ioport92_write, NULL); if (pci_enabled) { -ioapic = ioapic_init(); +ioapic = ioapic_init(env); } isa_pic = pic_init(pic_irq_request, first_cpu); pit = pit_init(0x40, 0); diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c index faa4684..03152e1 100644 --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -235,9 +235,16 @@ static void load_regs(CPUState *env) sregs.cr3 = env-cr[3]; sregs.cr4 = env-cr[4]; -sregs.apic_base = cpu_get_apic_base(env); +if (!kvm_apic_level) { + /* These two are no longer used once the in-kernel APIC is enabled */ + sregs.apic_base = 0; + sregs.cr8 = 0; +} else { + sregs.apic_base = cpu_get_apic_base(env); + sregs.cr8 = cpu_get_apic_tpr(env); +} + sregs.efer = env-efer; -sregs.cr8 = cpu_get_apic_tpr(env); kvm_set_sregs(kvm_context, 0, sregs); @@ -329,10 +336,12 @@ static void save_regs(CPUState *env) env-cr[3] = sregs.cr3; env-cr[4] = sregs.cr4; -cpu_set_apic_base(env,
Re: [kvm-devel] [PATCH/RFC 8/9] Virtual network host switch support
On Friday 11 May 2007 22:21, Anthony Liguori wrote: Any feel for the performance relative to the bridging code? The bridging code is a pretty big bottle neck in guest=guest communications in Xen at least. Last time I checked it we had a quite decent guest to guest performance in the gigabits/sec. On the downside the switch is quite aggressive with dropping packages as the inbound buffer of the virtual network adapters has space for 80 packets. (that can be changed) currently tested but not ready yet. We did not use the linux bridging code to allow non-root users to create virtual networks between guests. Is that the primary reason? If so, that seems like a rather large hammer for something that a userspace suid wrapper could have addressed... Actually there are some reasons why we did not use the bridging code: - One thing is, that a lot of OSA network cards do not support promiscous mode. There is also the issue that a lot of OSA cards are in layer 3 mode (we get IP packets and no ethernet frames) so bridging wont work to the host interface. - non-root switches - the performance of bridging (we copy directly from one guest buffer to another without allocating an skb on the host) - we considered to hook into the qeth driver (for OSA cards) to deal with layer3 mode. The first shot was actually a point-to-point driver (guest netif -- host netif). We added the switch at a later time. Hmm, if we can make bridging work (with a decent performance) on s390 that would reduce the maintainance work for us as this network switch is far from being complete. cheers Christian - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 4/4] KVM: in-kernel-apic modification to QEMU
This has the latest feedback from Anthony incorporated Signed-off-by: Gregory Haskins [EMAIL PROTECTED] --- qemu/hw/apic.c | 20 +++- qemu/hw/pc.c| 29 - qemu/qemu-kvm.c | 49 +++-- qemu/qemu-kvm.h |2 ++ qemu/vl.c |2 +- qemu/vl.h |7 ++- user/kvmctl.c | 33 - user/kvmctl.h | 31 ++- user/main.c |2 +- 9 files changed, 142 insertions(+), 33 deletions(-) diff --git a/qemu/hw/apic.c b/qemu/hw/apic.c index 0b73233..5665057 100644 --- a/qemu/hw/apic.c +++ b/qemu/hw/apic.c @@ -18,6 +18,7 @@ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ #include vl.h +#include qemu-kvm.h //#define DEBUG_APIC //#define DEBUG_IOAPIC @@ -87,6 +88,7 @@ typedef struct APICState { } APICState; struct IOAPICState { +CPUState *cpu_env; uint8_t id; uint8_t ioregsel; @@ -888,10 +890,17 @@ static void ioapic_service(IOAPICState *s) vector = pic_read_irq(isa_pic); else vector = entry 0xff; - -apic_get_delivery_bitmask(deliver_bitmask, dest, dest_mode); -apic_bus_deliver(deliver_bitmask, delivery_mode, - vector, polarity, trig_mode); + + if (kvm_allowed kvm_apic_level) { + ext_apic_bus_deliver(dest, trig_mode, dest_mode, +delivery_mode, vector); + cpu_interrupt(s-cpu_env, CPU_INTERRUPT_HARD); + } else { + apic_get_delivery_bitmask(deliver_bitmask, dest, + dest_mode); + apic_bus_deliver(deliver_bitmask, delivery_mode, +vector, polarity, trig_mode); + } } } } @@ -1045,7 +1054,7 @@ static CPUWriteMemoryFunc *ioapic_mem_write[3] = { ioapic_mem_writel, }; -IOAPICState *ioapic_init(void) +IOAPICState *ioapic_init(CPUState *env) { IOAPICState *s; int io_memory; @@ -1054,6 +1063,7 @@ IOAPICState *ioapic_init(void) if (!s) return NULL; ioapic_reset(s); +s-cpu_env = env; s-id = last_apic_id++; io_memory = cpu_register_io_memory(0, ioapic_mem_read, diff --git a/qemu/hw/pc.c b/qemu/hw/pc.c index eda49cf..618cc32 100644 --- a/qemu/hw/pc.c +++ b/qemu/hw/pc.c @@ -91,16 +91,19 @@ int cpu_get_pic_interrupt(CPUState *env) { int intno; -intno = apic_get_interrupt(env); -if (intno = 0) { -/* set irq request if a PIC irq is still pending */ -/* XXX: improve that */ -pic_update_irq(isa_pic); -return intno; +if (!use_kernel_apic()) { + intno = apic_get_interrupt(env); + if (intno = 0) { + /* set irq request if a PIC irq is still pending */ + /* XXX: improve that */ + pic_update_irq(isa_pic); + return intno; + } + + /* read the irq from the PIC */ + if (!apic_accept_pic_intr(env)) + return -1; } -/* read the irq from the PIC */ -if (!apic_accept_pic_intr(env)) -return -1; intno = pic_read_irq(isa_pic); return intno; @@ -483,9 +486,9 @@ static void pc_init1(int ram_size, int vga_ram_size, int boot_device, } register_savevm(cpu, i, 4, cpu_save, cpu_load, env); qemu_register_reset(main_cpu_reset, env); -if (pci_enabled) { -apic_init(env); -} + if (!use_kernel_apic() pci_enabled) { + apic_init(env); + } } /* allocate RAM */ @@ -671,7 +674,7 @@ static void pc_init1(int ram_size, int vga_ram_size, int boot_device, register_ioport_write(0x92, 1, 1, ioport92_write, NULL); if (pci_enabled) { -ioapic = ioapic_init(); +ioapic = ioapic_init(env); } isa_pic = pic_init(pic_irq_request, first_cpu); pit = pit_init(0x40, 0); diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c index faa4684..59e79bf 100644 --- a/qemu/qemu-kvm.c +++ b/qemu/qemu-kvm.c @@ -235,9 +235,16 @@ static void load_regs(CPUState *env) sregs.cr3 = env-cr[3]; sregs.cr4 = env-cr[4]; -sregs.apic_base = cpu_get_apic_base(env); +if (!kvm_apic_level) { + /* These two are no longer used once the in-kernel APIC is enabled */ + sregs.apic_base = 0; + sregs.cr8 = 0; +} else { + sregs.apic_base = cpu_get_apic_base(env); + sregs.cr8 = cpu_get_apic_tpr(env); +} + sregs.efer = env-efer; -sregs.cr8 = cpu_get_apic_tpr(env); kvm_set_sregs(kvm_context, 0, sregs); @@ -329,10 +336,12 @@ static void save_regs(CPUState *env) env-cr[3] = sregs.cr3; env-cr[4] = sregs.cr4; -cpu_set_apic_base(env,
Re: [kvm-devel] [PATCH/RFC 7/9] Virtual network guest device driver
Eric Van Hensbergen wrote: On 5/11/07, Anthony Liguori [EMAIL PROTECTED] wrote: cpu% ls /net/ether0 /net/ether0/0 /net/ether0/1 /net/ether0/2 /net/ether0/addr /net/ether0/clone /net/ether0/ifstats /net/ether0/stats This smells a bit like XenStore which I think most will agree was an unmitigated disaster. I'd have to disagree with you Anthony. The Plan 9 interfaces are simple and built into the kernel - they don't have the multi-layered-stack-python-xmlrpc garbage that made up the Xen interfaces. My point isn't that 9p is just like XenStore but rather that turning this idea into something that is useful and elegant is non-trivial. If it were just console access, I would agree with you, but its really about implementing a single solution for all drivers you are accessing across the interface. A single client versus dozens of different driver variants. There's definitely a conversation to have here. There are going to be a lot of small devices that would benefit from a common transport mechanism. Someone mentioned a PV entropy device on LKML. A host=guest filesystem is another consumer of such an interface. I'm inclined to think though that the abstraction point should be the transport and not the actual protocol. My concern with standardizing on a protocol like 9p would be that one would lose some potential optimizations (like passing PFN's directly between guest and host). Our existing 9p client for mini-os is ~3000 LOC and it is a pretty naive port from the p9p code base so it could probably be reduced even further. It is a very small percentage of our existing mini-os kernels and gives us console, disk, network, IP stack, file system, and control interfaces. Of course Linux clients could just use v9fs with a hypervisor-shared-memory transport which I haven't merged yet. We'll also be using the same set of interfaces for the simulator shortly. So is there any reason to even tie 9p to KVM? Why not just have a common PV transport that 9p can use. For certain things, it may make sense (like v9fs). Regards, Anthony Liguori Oh yeah, and don't forget the fact that resource access can bridge seamlessly over any network and the protocol has provisions to be secured with authentication/encryption/digesting if desired. Los Alamos will be presenting 9p based control interfaces for KVM at OLS. -eric - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel - This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel