[kvm-devel] [PATCH/RFC 1/9] s390 host memory management changes.

2007-05-11 Thread Carsten Otte
From: Heiko Carstens [EMAIL PROTECTED]

Add changes to s390 memory management which are necessary to use the s390
hardware assisted virtualization facility. For this the upper halve of each
page table needs to be reserved so the hardware can save extended page status
bits for the guest and the host.
Easy solution to this is to just change PTRS_PER_PTE and PTRS_PER_PMD
accordingly, so the upper halves of the pages that contain page tables are
unused and can be used by the hardware.
Unfortunately with these #ifdef changes we need twice as much memory for
processes, even for those which don't need to save extended status bits.

Maybe a better solution would be to make PTRS_PER_PTE and PTRS_PER_PMD
a per-process value and only double the size of the page tables if the
process wants to make use of the virtualization instruction.

Signed-off-by: Heiko Carstens [EMAIL PROTECTED]
Signed-off-by: Carsten Otte [EMAIL PROTECTED]

---
 include/asm-s390/page.h|8 +
 include/asm-s390/pgalloc.h |5 +
 include/asm-s390/pgtable.h |  197 -
 3 files changed, 209 insertions(+), 1 deletion(-)

Index: linux-2.6.21/include/asm-s390/pgtable.h
===
--- linux-2.6.21.orig/include/asm-s390/pgtable.h
+++ linux-2.6.21/include/asm-s390/pgtable.h
@@ -65,7 +65,11 @@ extern char empty_zero_page[PAGE_SIZE];
 # define PMD_SHIFT 22
 # define PGDIR_SHIFT   22
 #else /* __s390x__ */
+#ifdef CONFIG_S390_HOST
+# define PMD_SHIFT 20
+#else
 # define PMD_SHIFT 21
+#endif
 # define PGDIR_SHIFT   31
 #endif /* __s390x__ */
 
@@ -85,8 +89,13 @@ extern char empty_zero_page[PAGE_SIZE];
 # define PTRS_PER_PMD1
 # define PTRS_PER_PGD512
 #else /* __s390x__ */
+#ifdef CONFIG_S390_HOST
+# define PTRS_PER_PTE256
+# define PTRS_PER_PMD2048
+#else
 # define PTRS_PER_PTE512
 # define PTRS_PER_PMD1024
+#endif
 # define PTRS_PER_PGD2048
 #endif /* __s390x__ */
 
@@ -217,6 +226,18 @@ extern unsigned long vmalloc_end;
 #define _PAGE_SWT  0x001   /* SW pte type bit t */
 #define _PAGE_SWX  0x002   /* SW pte type bit x */
 
+#ifdef CONFIG_S390_HOST
+#define _PAGE_SOFT_REFERENCED  0x4
+#define _PAGE_SOFT_CHANGED 0x8
+
+/* Page status extended */
+#define _PAGE_RCP_PCL  0x0080UL
+#define _PAGE_RCP_HR   0x0040UL
+#define _PAGE_RCP_HC   0x0020UL
+#define _PAGE_RCP_GR   0x0004UL
+#define _PAGE_RCP_GC   0x0002UL
+#endif
+
 /* Six different types of pages. */
 #define _PAGE_TYPE_EMPTY   0x400
 #define _PAGE_TYPE_NONE0x401
@@ -514,6 +535,9 @@ static inline int pte_write(pte_t pte)
 
 static inline int pte_dirty(pte_t pte)
 {
+#ifdef CONFIG_S390_HOST
+   return (pte_val(pte)  _PAGE_SOFT_CHANGED) != 0;
+#endif
/* A pte is neither clean nor dirty on s/390. The dirty bit
 * is in the storage key. See page_test_and_clear_dirty for
 * details.
@@ -523,6 +547,9 @@ static inline int pte_dirty(pte_t pte)
 
 static inline int pte_young(pte_t pte)
 {
+#ifdef CONFIG_S390_HOST
+   return (pte_val(pte)  _PAGE_SOFT_REFERENCED) != 0;
+#endif
/* A pte is neither young nor old on s/390. The young bit
 * is in the storage key. See page_test_and_clear_young for
 * details.
@@ -582,7 +609,9 @@ static inline void pgd_clear(pgd_t * pgd
 static inline void pmd_clear_kernel(pmd_t * pmdp)
 {
pmd_val(*pmdp) = _PMD_ENTRY_INV | _PMD_ENTRY;
+#ifndef CONFIG_S390_HOST
pmd_val1(*pmdp) = _PMD_ENTRY_INV | _PMD_ENTRY;
+#endif
 }
 
 static inline void pmd_clear(pmd_t * pmdp)
@@ -632,6 +661,9 @@ static inline pte_t pte_mkwrite(pte_t pt
 
 static inline pte_t pte_mkclean(pte_t pte)
 {
+#ifdef CONFIG_S390_HOST
+   pte_val(pte) = ~_PAGE_SOFT_CHANGED;
+#endif
/* The only user of pte_mkclean is the fork() code.
   We must *not* clear the *physical* page dirty bit
   just because fork() wants to clear the dirty bit in
@@ -641,6 +673,9 @@ static inline pte_t pte_mkclean(pte_t pt
 
 static inline pte_t pte_mkdirty(pte_t pte)
 {
+#ifdef CONFIG_S390_HOST
+   pte_val(pte) |= _PAGE_SOFT_CHANGED;
+#endif
/* We do not explicitly set the dirty bit because the
 * sske instruction is slow. It is faster to let the
 * next instruction set the dirty bit.
@@ -650,6 +685,9 @@ static inline pte_t pte_mkdirty(pte_t pt
 
 static inline pte_t pte_mkold(pte_t pte)
 {
+#ifdef CONFIG_S390_HOST
+   pte_val(pte) = ~_PAGE_SOFT_REFERENCED;
+#endif
/* S/390 doesn't keep its dirty/referenced bit in the pte.
 * There is no point in clearing the real referenced bit.
 */
@@ -658,14 +696,111 @@ static inline pte_t pte_mkold(pte_t pte)
 
 static inline pte_t pte_mkyoung(pte_t pte)
 {
+#ifdef CONFIG_S390_HOST
+   pte_val(pte) |= _PAGE_SOFT_REFERENCED;
+#endif
/* S/390 doesn't keep its dirty/referenced bit in the pte.
 * 

[kvm-devel] [PATCH/RFC 2/9] s390 virtualization interface

2007-05-11 Thread Carsten Otte
From: Heiko Carstens [EMAIL PROTECTED]

Add interface which allows a process to start a virtual machine.

To keep things easy each thread group is allowed to have only one
virtual machine and each thread of the thread group can only control
one virtual cpu of the virtual machine. All the information about
the virtual machines/cpus can be found via the thread_info structures
of the participating threads.

This patch adds three new s390 specific system calls:

long sys_s390host_add_cpu(unsigned long addr, unsigned long flags,
  struct sie_block __user *sie_template)

Adds a new cpu to a the virtual machine that belongs to the current
thread group. If no virtual machine exists it will be created. In
addition two pages will be allocated and mapped at addr into the
address space of the process. These two pages are used so user space
and kernel space can easily exchange/modify the state of the
corresponding virtual cpu without a ton of copy_from/to_user calls.
The sie_template is a pointer to a data structure that contains
initial information how the virtual cpu should be setup. The
resulting block will be used as a parameter to issue the sie (start
interpretive execution) instruction which starts a virtual cpu.

int sys_s390host_remove_cpu(void)

Removes a virtual cpu from a virtual machine.

int sys_s390host_sie(unsigned long action)

Starts / re-enters the virtual cpu of the virtual machine that the
thread belongs to, if any.

Please note that this patch is nothing more than a proof-of-concept
and may contain quite a few bugs.
Since we want to convert to use kvm instead, most of this will be
dropped anyway. But maybe this is of interest for others as well.

Signed-off-by: Heiko Carstens [EMAIL PROTECTED]
Signed-off-by: Carsten Otte [EMAIL PROTECTED]

---
 arch/s390/Kconfig   |7 
 arch/s390/Makefile  |2 
 arch/s390/host/Makefile |5 
 arch/s390/host/s390_intercept.c |   42 
 arch/s390/host/s390host.c   |  418 
 arch/s390/host/s390host.h   |   16 +
 arch/s390/host/sie64a.S |   38 +++
 arch/s390/kernel/asm-offsets.c  |2 
 arch/s390/kernel/process.c  |   15 +
 arch/s390/kernel/setup.c|4 
 arch/s390/kernel/syscalls.S |3 
 include/asm-s390/sie64.h|  279 ++
 include/asm-s390/thread_info.h  |8 
 include/asm-s390/unistd.h   |5 
 kernel/sys_ni.c |3 
 15 files changed, 842 insertions(+), 5 deletions(-)

Index: linux-2.6.21/arch/s390/kernel/asm-offsets.c
===
--- linux-2.6.21.orig/arch/s390/kernel/asm-offsets.c
+++ linux-2.6.21/arch/s390/kernel/asm-offsets.c
@@ -44,5 +44,7 @@ int main(void)
DEFINE(__SF_BACKCHAIN, offsetof(struct stack_frame, back_chain),);
DEFINE(__SF_GPRS, offsetof(struct stack_frame, gprs),);
DEFINE(__SF_EMPTY, offsetof(struct stack_frame, empty1),);
+   BLANK();
+   DEFINE(__SIE_USER_gprs, offsetof(struct sie_user, gprs),);
return 0;
 }
Index: linux-2.6.21/arch/s390/kernel/syscalls.S
===
--- linux-2.6.21.orig/arch/s390/kernel/syscalls.S
+++ linux-2.6.21/arch/s390/kernel/syscalls.S
@@ -322,3 +322,6 @@ NI_SYSCALL  
/* 310 sys_move_pages *
 SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper)
 SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper)
 SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper)
+SYSCALL(sys_ni_syscall,sys_s390host_add_cpu,sys_ni_syscall)
+SYSCALL(sys_ni_syscall,sys_s390host_remove_cpu,sys_ni_syscall)
+SYSCALL(sys_ni_syscall,sys_s390host_sie,sys_ni_syscall)
Index: linux-2.6.21/arch/s390/host/Makefile
===
--- /dev/null
+++ linux-2.6.21/arch/s390/host/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for the s390host components.
+#
+
+obj-$(CONFIG_S390_HOST)+= s390host.o sie64a.o s390_intercept.o
Index: linux-2.6.21/arch/s390/host/sie64a.S
===
--- /dev/null
+++ linux-2.6.21/arch/s390/host/sie64a.S
@@ -0,0 +1,38 @@
+/*
+ *  arch/s390/host/sie64a.S
+ *low level sie call
+ *
+ *Copyright IBM Corp. 2007
+ *Author(s): Heiko Carstens [EMAIL PROTECTED]
+ *License  : GPL
+ */
+
+#include linux/errno.h
+#include asm/asm-offsets.h
+
+SP_R6 =6 * 8   # offset into stackframe
+
+   .globl  sie64a
+sie64a:
+   stmg%r6,%r15,SP_R6(%r15)# save register on entry
+   lgr %r14,%r2# pointer to program parms
+   aghi%r2,4096
+   lmg %r0,%r13,__SIE_USER_gprs(%r2)   # load guest gprs 0-13
+sie_inst:
+   sie 0(%r14)
+   aghi%r14,4096
+   stmg%r0,%r13,__SIE_USER_gprs(%r14)  # save guest gprs 0-13
+   lghi%r2,0
+  

[kvm-devel] [PATCH/RFC 3/9] s390 guest detection

2007-05-11 Thread Carsten Otte
From: Christian Borntraeger [EMAIL PROTECTED]

This patch adds functionality to detect if the kernel runs under an s390host
hypervisor. A macro MACHINE_IS_GUEST is exported for device drivers. This
allows drivers to skip device detection if the systems runs non-virtualized.

Signed-off-by: Christian Borntraeger [EMAIL PROTECTED]
Signed-off-by: Carsten Otte [EMAIL PROTECTED]

---
 arch/s390/kernel/early.c |4 
 arch/s390/kernel/setup.c |9 ++---
 include/asm-s390/setup.h |1 +
 3 files changed, 11 insertions(+), 3 deletions(-)

Index: linux-2.6.21/arch/s390/kernel/setup.c
===
--- linux-2.6.21.orig/arch/s390/kernel/setup.c
+++ linux-2.6.21/arch/s390/kernel/setup.c
@@ -744,9 +744,12 @@ setup_arch(char **cmdline_p)
   This machine has an IEEE fpu\n :
   This machine has no IEEE fpu\n);
 #else /* CONFIG_64BIT */
-   printk((MACHINE_IS_VM) ?
-  We are running under VM (64 bit mode)\n :
-  We are running native (64 bit mode)\n);
+   if (MACHINE_IS_VM)
+   printk(We are running under VM (64 bit mode)\n);
+   else if (MACHINE_IS_GUEST)
+   printk(We are running on a non z/VM host\n);
+   else
+   printk(We are running native (64 bit mode)\n);
 #endif /* CONFIG_64BIT */
 
/* Save unparsed command line copy for /proc/cmdline */
Index: linux-2.6.21/include/asm-s390/setup.h
===
--- linux-2.6.21.orig/include/asm-s390/setup.h
+++ linux-2.6.21/include/asm-s390/setup.h
@@ -61,6 +61,7 @@ extern unsigned long machine_flags;
 #define MACHINE_IS_VM  (machine_flags  1)
 #define MACHINE_IS_P390(machine_flags  4)
 #define MACHINE_HAS_MVPG   (machine_flags  16)
+#define MACHINE_IS_GUEST   (machine_flags  64)
 #define MACHINE_HAS_IDTE   (machine_flags  128)
 #define MACHINE_HAS_DIAG9C (machine_flags  256)
 
Index: linux-2.6.21/arch/s390/kernel/early.c
===
--- linux-2.6.21.orig/arch/s390/kernel/early.c
+++ linux-2.6.21/arch/s390/kernel/early.c
@@ -139,6 +139,10 @@ static noinline __init void detect_machi
/* Running on a P/390 ? */
if (cpuinfo-cpu_id.machine == 0x7490)
machine_flags |= 4;
+
+   /* Running under a host ? */
+   if (cpuinfo-cpu_id.version == 0xfe)
+   machine_flags |= 64;
 }
 
 #ifdef CONFIG_64BIT



-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH/RFC 5/9] s390 virtual console for guests

2007-05-11 Thread Carsten Otte
From: Carsten Otte [EMAIL PROTECTED]

This driver provides a simple virtualized console. Userspace can
use read/write to its console to pass the data to the host.

Signed-off-by: Carsten Otte [EMAIL PROTECTED]

---
 drivers/s390/Kconfig   |5 +
 drivers/s390/guest/Makefile|1 
 drivers/s390/guest/guest_console.c |   72 +
 drivers/s390/guest/guest_console.h |   47 +++
 drivers/s390/guest/guest_tty.c |  153 +
 5 files changed, 278 insertions(+)

Index: linux-2.6.21/drivers/s390/guest/guest_console.c
===
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/guest_console.c
@@ -0,0 +1,72 @@
+/*
+ * guest console device driver
+ * Copyright IBM Corp. 2007
+ * Author: Carsten Otte [EMAIL PROTECTED]
+ */
+
+#include linux/kernel.h
+#include linux/types.h
+#include linux/console.h
+#include linux/string.h
+#include linux/init.h
+#include linux/errno.h
+#include guest_console.h
+
+#define guest_console_major 4  /* TTYAUX_MAJOR */
+#define guest_console_minor 65
+#define guest_console_name  ttyS
+
+static void guest_console_write(struct console *console, const char *string,
+unsigned len)
+{
+   int ret;
+   size_t pos;
+
+   for(pos=0; pos  strlen(string); pos += ret) {
+   ret = diag_write(1, string + pos, len - pos);
+   if (ret = 0)
+   break;
+   }
+}
+
+static struct tty_driver *
+guest_console_device(struct console *c, int *index)
+{
+   *index = c-index;
+   return guest_tty_driver;
+}
+
+static void
+guest_console_unblank(void)
+{
+   return;
+}
+
+static struct console guest_console =
+{
+   .name = guest_console_name,
+   .write = guest_console_write,
+   .device = guest_console_device,
+   .unblank = guest_console_unblank,
+   .flags = CON_PRINTBUFFER,
+   .index = 0 /* ttyS0 */
+};
+
+/*
+ * called by console_init() in drivers/char/tty_io.c at boot-time.
+ */
+static int __init
+guest_console_init(void)
+{
+   if (!MACHINE_IS_GUEST)
+   return 0;
+
+   printk (KERN_INFO z/Live console initialized\n);
+
+   /* enable printk-access to this driver */
+   register_console(guest_console);
+   return 0;
+}
+
+console_initcall(guest_console_init);
+
Index: linux-2.6.21/drivers/s390/guest/guest_console.h
===
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/guest_console.h
@@ -0,0 +1,47 @@
+/*
+ * guest console device driver
+ * Copyright IBM Corp. 2007
+ * Author: Carsten Otte [EMAIL PROTECTED]
+ */
+
+
+#ifndef __GCONSOLE_H
+#define __GCONSOLE_H
+extern struct tty_driver *guest_tty_driver;
+static inline int diag_write(int fd, const void *buffer, size_t count)
+{
+   register long __arg1 asm(2) = fd;
+   register const void * __arg2 asm(3) = buffer;
+   register size_t __arg3 asm(4) = count;
+   register long __svcres asm(2);
+   long __res;
+   asm volatile (
+   diag 0,0,2
+   : =d (__svcres)
+   : 0 (__arg1),
+ d (__arg2),
+ d (__arg3)
+   : cc, memory);
+   __res = __svcres;
+   return __res;
+}
+
+static inline int diag_read(int fd, const void *buffer, size_t count)
+{
+   register long __arg1 asm(2) = fd;
+   register const void * __arg2 asm(3) = buffer;
+   register size_t __arg3 asm(4) = count;
+   register long __svcres asm(2);
+   long __res;
+   asm volatile (
+   diag 0,0,1
+   : =d (__svcres)
+   : 0 (__arg1),
+ d (__arg2),
+ d (__arg3)
+   : cc, memory);
+   __res = __svcres;
+   return __res;
+}
+#endif
+
Index: linux-2.6.21/drivers/s390/guest/guest_tty.c
===
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/guest_tty.c
@@ -0,0 +1,153 @@
+/*
+ * guest console tty device driver
+ * Copyright IBM Corp. 2007
+ * Author: Carsten Otte [EMAIL PROTECTED]
+ */
+
+#include linux/fs.h
+#include linux/tty.h
+#include linux/tty_flip.h
+#include linux/module.h
+#include asm/s390_ext.h
+#include guest_console.h
+
+struct tty_driver *guest_tty_driver;
+static struct tty_struct *guest_tty;
+
+MODULE_DESCRIPTION(Guest console for linux guests);
+MODULE_AUTHOR(Carsten Otte [EMAIL PROTECTED]);
+MODULE_LICENSE(GPL);
+
+static int
+guest_tty_open(struct tty_struct *tty, struct file *filp)
+{
+   guest_tty = tty;
+   tty-driver_data = NULL;
+   return 0;
+}
+
+static void
+guest_tty_close(struct tty_struct *tty, struct file *filp)
+{
+   if (tty-count  1)
+   return;
+   guest_tty = NULL;
+}
+
+static int
+guest_tty_ioctl(struct tty_struct *tty, struct file * file,
+  unsigned int cmd, unsigned long arg)
+{
+   

[kvm-devel] [PATCH/RFC 9/9] Fix system-user misaccount of interpreted execution

2007-05-11 Thread Carsten Otte
From: Christian Borntraeger [EMAIL PROTECTED]

This patches fixes the accouting of guest cpu time. As sie is executed via a
system call, all guest operations were accounted as system time. To fix this
we define a per thread sie context. Before issuing the sie instruction we
enter this context and leave the context afterwards. sie_enter and sie_exit
call account_system_vtime, which now checks for being in sie_context. We 
define the sie_context to be accounted as user time.

Possible future enhancement: We could add an additional field: interpretion
time to cpu stat and process time. Thus we could differentiate between user
time in the host and host user time spent for guests. The main challenge is
the necessary user space change. Therefore, we could export the interpretion
time with a new interface. To be defined.

Signed-off-By: Christian Borntraeger [EMAIL PROTECTED]
Signed-off-By: Carsten Otte [EMAIL PROTECTED]

---
 arch/s390/Kconfig  |1 +
 arch/s390/host/s390host.c  |   15 +++
 arch/s390/kernel/process.c |1 +
 arch/s390/kernel/vtime.c   |   11 ++-
 include/asm-s390/thread_info.h |2 ++
 5 files changed, 29 insertions(+), 1 deletion(-)

Index: linux-2.6.21/arch/s390/kernel/vtime.c
===
--- linux-2.6.21.orig/arch/s390/kernel/vtime.c
+++ linux-2.6.21/arch/s390/kernel/vtime.c
@@ -97,6 +97,11 @@ void account_vtime(struct task_struct *t
account_system_time(tsk, 0, cputime);
 }
 
+static inline int task_is_in_sie(struct thread_info *thread)
+{
+   return thread-in_sie;
+}
+
 /*
  * Update process times based on virtual cpu times stored by entry.S
  * to the lowcore fields user_timer, system_timer  steal_clock.
@@ -114,7 +119,11 @@ void account_system_vtime(struct task_st
cputime =  S390_lowcore.system_timer  12;
S390_lowcore.system_timer -= cputime  12;
S390_lowcore.steal_clock -= cputime  12;
-   account_system_time(tsk, 0, cputime);
+
+   if (task_is_in_sie(tsk-thread_info)  !hardirq_count()  
!softirq_count())
+   account_user_time(tsk, cputime);
+   else
+   account_system_time(tsk, 0, cputime);
 }
 
 static inline void set_vtimer(__u64 expires)
Index: linux-2.6.21/arch/s390/host/s390host.c
===
--- linux-2.6.21.orig/arch/s390/host/s390host.c
+++ linux-2.6.21/arch/s390/host/s390host.c
@@ -27,6 +27,19 @@ static int s390host_do_action(unsigned l
 
 static DEFINE_MUTEX(s390host_init_mutex);
 
+static void enter_sie(void)
+{
+   account_system_vtime(current);
+   current_thread_info()-in_sie = 1;
+}
+
+static void exit_sie(void)
+{
+   account_system_vtime(current);
+   current_thread_info()-in_sie = 0;
+}
+
+
 static void s390host_get_data(struct s390host_data *data)
 {
atomic_inc(data-count);
@@ -297,7 +310,9 @@ again:
schedule();
 
sie_kernel-sie_block.icptcode = 0;
+   enter_sie();
ret = sie64a(sie_kernel);
+   exit_sie();
if (ret)
goto out;
 
Index: linux-2.6.21/include/asm-s390/thread_info.h
===
--- linux-2.6.21.orig/include/asm-s390/thread_info.h
+++ linux-2.6.21/include/asm-s390/thread_info.h
@@ -55,6 +55,7 @@ struct thread_info {
struct restart_blockrestart_block;
struct s390host_data*s390host_data; /* s390host data */
int sie_cpu;/* sie cpu number */
+   int in_sie; /* 1 = cpu is in sie*/
 };
 
 /*
@@ -72,6 +73,7 @@ struct thread_info {
},  \
.s390host_data  = NULL, \
.sie_cpu= 0,\
+   .in_sie = 0,\
 }
 
 #define init_thread_info   (init_thread_union.thread_info)
Index: linux-2.6.21/arch/s390/kernel/process.c
===
--- linux-2.6.21.orig/arch/s390/kernel/process.c
+++ linux-2.6.21/arch/s390/kernel/process.c
@@ -278,6 +278,7 @@ int copy_thread(int nr, unsigned long cl
memset(p-thread.per_info,0,sizeof(p-thread.per_info));
p-thread_info-s390host_data = NULL;
p-thread_info-sie_cpu = -1;
+   p-thread_info-in_sie = 0;
 
 return 0;
 }
Index: linux-2.6.21/arch/s390/Kconfig
===
--- linux-2.6.21.orig/arch/s390/Kconfig
+++ linux-2.6.21/arch/s390/Kconfig
@@ -519,6 +519,7 @@ config S390_HOST
bool s390 host support (EXPERIMENTAL)
depends on 64BIT  EXPERIMENTAL
select S390_SWITCH_AMODE
+   select VIRT_CPU_ACCOUNTING
help
  Select this option if you want to host guest Linux images
 



-
This SF.net email is 

Re: [kvm-devel] [PATCH/RFC 5/9] s390 virtual console for guests

2007-05-11 Thread Anthony Liguori
I think it would be better to use hvc_console as Xen now uses it too.

Carsten Otte wrote:
 + if (!MACHINE_IS_GUEST)
 + return 0;
 + register_external_interrupt(0x1234, guest_tty_ext_handler);
   

This is an interesting way to get input data from the console :-)  How 
many interrupts does s390 support (the x86 only supports 256)?  Can you 
afford to burn interrupts like this?  Is there not a better way to 
assign interrupts such that conflict isn't an issue?

Regards,

Anthony Liguori

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH/RFC 5/9] s390 virtual console for guests

2007-05-11 Thread Christian Bornträger
On Friday 11 May 2007 21:00, Anthony Liguori wrote:

 I think it would be better to use hvc_console as Xen now uses it too.

I dont know hvc_console, but I will have a look at it.

 Carsten Otte wrote:
  +   if (!MACHINE_IS_GUEST)
  +   return 0;
  +   register_external_interrupt(0x1234, guest_tty_ext_handler);

 
 This is an interesting way to get input data from the console :-)  How 
 many interrupts does s390 support (the x86 only supports 256)?  Can you 
 afford to burn interrupts like this?  Is there not a better way to 
 assign interrupts such that conflict isn't an issue?

On s390 we have a 16 bit interrupt code, so we actually have plenty of 
numbers... But, yes its a very good point, burning interrupts wont work 
cross-platform.

Our patches are prototypes and need rework anyway. Take these patches as 
discussion contribution in the spirit of release early. :-)

cheers

Christian

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH/RFC 7/9] Virtual network guest device driver

2007-05-11 Thread ron minnich
Let me ask what may seem to be a naive question to the linux world. I
see you are doing a lot off solid work on adding block and network
devices. The code for block and network devices
is implemented in different ways. I've also seen this difference of
inerface/implementation on Xen.

Hence my question:
Why are the INTERFACES to the block and network devices different? I
can understand that the implementation -- what goes on inside the
box -- would be different. But, again, why is the interface to the
resource different in each case? Will every distinct type of I/O
device end up with a different interface?

These questions doubtless seem naive, I suppose, except I use a system
(Plan 9) in which a common interface is in fact used for the different
resources. I have been hoping that we could bring this model -- same
interface, different resource -- to the inter-vm communications. I
would like to at least raise the idea that it could be used on KVM.

Avoiding too much detail, in the plan 9 world, read and write of data
to a disk is via file read and write system calls. Same for a network.
Same for the mouse, the window system, the serial port, the console,
USB, and so on. Please see this note from IBM on what is
possible:http://domino.watson.ibm.com/library/CyberDig.nsf/0/c6c779bbf1650fa4852570670054f3ca?OpenDocument
or http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf

Different resources, same interface. In the hypervisor world, you
build one shared memory queue as a basic abstraction. On top of that
queue, you run 9P. The provider (network, block device, etc.) provides
certain resources to you, the guest domain The resources have names. A
network can look like this, to a kvm guest (this command from a Plan 9
system):
cpu% ls /net/ether0
/net/ether0/0
/net/ether0/1
/net/ether0/2
/net/ether0/addr
/net/ether0/clone
/net/ether0/ifstats
/net/ether0/stats
To get network stats, or do I/O, one simply gains access to the
appropriate ring buffer, by finding the name, and does the ring buffer
sends and receives via shared memory queues. The I/O operations can be
very efficient.

Disk looks like this:
cpu% ls -l /dev/sdC0
--rw-r- S 0 bootes bootes   104857600 Jan 22 15:49 /dev/sdC0/9fat
--rw-r- S 0 bootes bootes 65361213440 Jan 22 15:49 /dev/sdC0/arenas
--rw-r- S 0 bootes bootes   0 Jan 22 15:49 /dev/sdC0/ctl
--rw-r- S 0 bootes bootes 82348277760 Jan 22 15:49 /dev/sdC0/data
--rw-r- S 0 bootes bootes 13072242688 Jan 22 15:49 /dev/sdC0/fossil
--rw-r- S 0 bootes bootes  3268060672 Jan 22 15:49 /dev/sdC0/isect
--rw-r- S 0 bootes bootes 512 Jan 22 15:49 /dev/sdC0/nvram
--rw-r- S 0 bootes bootes 82343245824 Jan 22 15:49 /dev/sdC0/plan9
-lrw--- S 0 bootes bootes   0 Jan 22 15:49 /dev/sdC0/raw
--rw-r- S 0 bootes bootes   536870912 Jan 22 15:49 /dev/sdC0/swap
cpu%

So the disk partitions are files, with the data file being the
whole disk. Again, on a hypervisor system, to do I/O, software could
create a connection to the file and establish the in-memory ring
buffer, for that partition. This I/O can be very efficient; IBM
research is working on zero-copy mechanisms for moving data between
domains.

The result is a single, consistent mechanism for accessing all
resources from a guest domain. The resources have names, and it is
easy to examine the status -- binary interfaces can be minimized. The
resources can be provided by in-kernel servers -- Linux drivers -- or
out-of-kernel servers -- proceses. Same interface, and yet the
implementation of the provider of the resource can be utterly
different.

We had hoped to get something like this into Xen. On Xen, for example,
the block device and ethernet device interfaces are as different as
one could imagine. Disk I/O does not steal pages from the guest. The
network does. Disk I/O is in 4k chunks, period, with a bitmap
describing which of the 8 512-byte subunits are being sent. The enet
device, on read, returns a page with your packet, but also potentially
containing bits of other domain's packets too. The interfaces are as
dissimilar as they can be, and I see no reason for such a huge
variance between what are basically read/write devices.

Another issue is that kvm, in its current form (-24) is beautifully
simple. These additions seem to detract from the beauty a  bit. Might
it be worth taking a little time to consider these ideas in order to
preserve the basic elegance of KVM?

So, before we go too far down the Xen-like paravirtualized device
route, can we discuss the way this ought to look a bit?

thanks

ron

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net

Re: [kvm-devel] [PATCH/RFC 4/9] Basic guest virtual devices infrastructure

2007-05-11 Thread Arnd Bergmann
On Friday 11 May 2007, Carsten Otte wrote:

 This patch adds support for a new bus type that manages paravirtualized
 devices. The bus uses the s390 diagnose instruction to query devices, and
 match them with the corresponding drivers.

It seems that the diagnose instruction is really the only s390 specific
thing in here, right? I guess this part of your series is the first one
that we should have in an architecture independent way.

There may also be the chance of merging this with existing virtual
buses like the one for the ps3, which also just exists using
hypercalls.

 +int vdev_match(struct device * dev, struct device_driver *drv)
 +{
 + struct vdev *vdev = to_vdev(dev);
 + struct vdev_driver *vdrv = to_vdrv(drv);
 +
 + if (vdev-vdev_type == vdrv-vdev_type)
 + return 1;
 +
 + return 0;
 +}

Why invent device type numbers? On open firmware, we just do a string compare,
which more intuitive, and means you don't need any further 

 +int vdev_probe(struct device * dev)
 +{
 + struct vdev *vdev = to_vdev(dev);
 + struct vdev_driver *vdrv = to_vdrv(dev-driver);
 +
 + return vdrv-probe(vdev);
 +}

This abstraction is unnecessary, just do the do_vdev() conversion inside
of the individual drivers.

 +
 +struct device vdev_bus = {
 + .bus_id  = vdev0,
 + .release = vdev_bus_release
 +};
 
 +static void vdev_bus_release (struct device *device)
 +{
 + /* noop, static bus object */
 +}

Just make the root of your devices a platform_device, then you don't need
to do dirty tricks like this.

 +static int vdev_scan_coldplug(void)
 +{
 + int rc;
 + struct vdev *device;
 +
 + do {
 + device = kzalloc(sizeof(struct vdev), GFP_ATOMIC);
 + if (!device) {
 + rc = -ENOMEM;
 + goto out;
 + }
 + rc = vdev_diag_hotplug(device-symname, device-hostid);
 + if (rc == -ENODEV)
 + break;
 + if (rc  0) {
 + printk (KERN_WARNING vdev: error %d detecting \
 +  initial devices\n, rc);
 + break;
 + }
 + device-vdev_type = rc;
 +
 + //sanity: are strings terminated?
 + if ((strnlen(device-symname, 128) == 128) ||
 + (strnlen(device-hostid, 128) == 128)) {
 + // warn and discard device
 + printk (vdev: illegal device entry received\n);
 + break;
 + }
 +
 + rc = vdevice_register(device);
 + if (rc) {
 + kfree(device);
 + } else
 + switch (device-vdev_type) {
 + case VDEV_TYPE_DISK:
 + printk (KERN_INFO vdev: storage device  \
 + detected: %s\n, device-symname);
 + break;
 + case VDEV_TYPE_NET:
 + printk (KERN_INFO vdev: network device  \
 + detected: %s\n, device-symname);
 + break;
 + default:
 + printk (KERN_INFO vdev: unknown device  \
 + detected: %s\n, device-symname);
 + }
 + } while(1);
 + kfree (device);
 + out:
 + return 0;
 +}

Interesting concept of probing the bus -- so you just ask if there are
any new devices, right?

 +#define VDEV_TYPE_DISK 0
 +#define VDEV_TYPE_NET  1
 +
 +struct vdev {
 + unsigned intvdev_type;
 + charsymname[128];
 + charhostid[128];
 + struct vdev_driver *driver;
 + struct device   dev;
 + void*drv_private;
 +};

You shouldn't need the driver and drv_private fields -- they are already
present in struct device.

Arnd 

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH/RFC 7/9] Virtual network guest device driver

2007-05-11 Thread Anthony Liguori
ron minnich wrote:
 Avoiding too much detail, in the plan 9 world, read and write of data
 to a disk is via file read and write system calls.

For low speed devices, I think paravirtualization doesn't make a lot of 
sense unless it's absolutely required.  I don't know enough about s390 
to know if it supports things like uarts but if so, then emulating a 
uart would in my mind make a lot more sense than a PV console device.

  Same for a network.
 Same for the mouse, the window system, the serial port, the console,
 USB, and so on. Please see this note from IBM on what is
 possible:http://domino.watson.ibm.com/library/CyberDig.nsf/0/c6c779bbf1650fa4852570670054f3ca?OpenDocument
 or http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf
 Different resources, same interface. In the hypervisor world, you
 build one shared memory queue as a basic abstraction. On top of that
 queue, you run 9P. The provider (network, block device, etc.) provides
 certain resources to you, the guest domain The resources have names. A
 network can look like this, to a kvm guest (this command from a Plan 9
 system):
 cpu% ls /net/ether0
 /net/ether0/0
 /net/ether0/1
 /net/ether0/2
 /net/ether0/addr
 /net/ether0/clone
 /net/ether0/ifstats
 /net/ether0/stats
   

This smells a bit like XenStore which I think most will agree was an 
unmitigated disaster.  This sort of thing gets terribly complicated to 
deal with in the corner cases.  Atomic operation of multiple read/write 
operations is difficult to express.  Moreover, quite a lot of things are 
naturally expressed as a state machine which is not straight forward to 
do in this sort of model.  This may have been all figured out in 9P but 
it's certainly not a simple thing to get right.

I think a general rule of thumb for a virtualized environment is that 
the closer you stick to the way hardware tends to do things, the less 
likely you are to screw yourself up and the easier it will be for other 
platforms to support your devices.  Implementing a full 9P client just 
to get console access in something like mini-os would be unfortunate.  
At least the posted s390 console driver behaves roughly like a uart so 
it's pretty obvious that it will be easy to implement in any OS that 
supports uarts already.

Regards,

Anthony Liguori

 To get network stats, or do I/O, one simply gains access to the
 appropriate ring buffer, by finding the name, and does the ring buffer
 sends and receives via shared memory queues. The I/O operations can be
 very efficient.

 Disk looks like this:
 cpu% ls -l /dev/sdC0
 --rw-r- S 0 bootes bootes   104857600 Jan 22 15:49 /dev/sdC0/9fat
 --rw-r- S 0 bootes bootes 65361213440 Jan 22 15:49 /dev/sdC0/arenas
 --rw-r- S 0 bootes bootes   0 Jan 22 15:49 /dev/sdC0/ctl
 --rw-r- S 0 bootes bootes 82348277760 Jan 22 15:49 /dev/sdC0/data
 --rw-r- S 0 bootes bootes 13072242688 Jan 22 15:49 /dev/sdC0/fossil
 --rw-r- S 0 bootes bootes  3268060672 Jan 22 15:49 /dev/sdC0/isect
 --rw-r- S 0 bootes bootes 512 Jan 22 15:49 /dev/sdC0/nvram
 --rw-r- S 0 bootes bootes 82343245824 Jan 22 15:49 /dev/sdC0/plan9
 -lrw--- S 0 bootes bootes   0 Jan 22 15:49 /dev/sdC0/raw
 --rw-r- S 0 bootes bootes   536870912 Jan 22 15:49 /dev/sdC0/swap
 cpu%

 So the disk partitions are files, with the data file being the
 whole disk. Again, on a hypervisor system, to do I/O, software could
 create a connection to the file and establish the in-memory ring
 buffer, for that partition. This I/O can be very efficient; IBM
 research is working on zero-copy mechanisms for moving data between
 domains.

 The result is a single, consistent mechanism for accessing all
 resources from a guest domain. The resources have names, and it is
 easy to examine the status -- binary interfaces can be minimized. The
 resources can be provided by in-kernel servers -- Linux drivers -- or
 out-of-kernel servers -- proceses. Same interface, and yet the
 implementation of the provider of the resource can be utterly
 different.

 We had hoped to get something like this into Xen. On Xen, for example,
 the block device and ethernet device interfaces are as different as
 one could imagine. Disk I/O does not steal pages from the guest. The
 network does. Disk I/O is in 4k chunks, period, with a bitmap
 describing which of the 8 512-byte subunits are being sent. The enet
 device, on read, returns a page with your packet, but also potentially
 containing bits of other domain's packets too. The interfaces are as
 dissimilar as they can be, and I see no reason for such a huge
 variance between what are basically read/write devices.

 Another issue is that kvm, in its current form (-24) is beautifully
 simple. These additions seem to detract from the beauty a  bit. Might
 it be worth taking a little time to consider these ideas in order to
 preserve the basic elegance of KVM?

 So, before we go too far down the Xen-like paravirtualized device
 route, can we discuss 

Re: [kvm-devel] [PATCH/RFC 8/9] Virtual network host switch support

2007-05-11 Thread Anthony Liguori
Carsten Otte wrote:
 From: Christian Borntraeger [EMAIL PROTECTED]

 This is the host counterpart for the virtual network device driver. This 
 driver
 has an char device node where the hypervisor can attach. It also
 has a kind of dumb switch that passes packets between guests. Last but not 
 least
 it contains a host network interface. Patches for attaching other host network
 devices to the switch via raw sockets, extensions to qeth or netfilter are
   

Any feel for the performance relative to the bridging code?  The 
bridging code is a pretty big bottle neck in guest=guest communications 
in Xen at least.

 currently tested but not ready yet. We did not use the linux bridging code to
 allow non-root users to create virtual networks between guests. 
   

Is that the primary reason?  If so, that seems like a rather large 
hammer for something that a userspace suid wrapper could have addressed...

Regards,

Anthony Liguori

 Signed-off-by: Christian Borntraeger [EMAIL PROTECTED]
 Signed-off-by: Carsten Otte [EMAIL PROTECTED]

 ---
  drivers/s390/guest/Makefile  |3 
  drivers/s390/guest/vnet_port_guest.c |  302 
  drivers/s390/guest/vnet_port_guest.h |   21 
  drivers/s390/guest/vnet_port_host.c  |  418 +
  drivers/s390/guest/vnet_port_host.h  |   18 
  drivers/s390/guest/vnet_switch.c |  828 
 +++
  drivers/s390/guest/vnet_switch.h |  119 +
  drivers/s390/net/Kconfig |   12 
  8 files changed, 1721 insertions(+)

 Index: linux-2.6.21/drivers/s390/guest/vnet_port_guest.c
 ===
 --- /dev/null
 +++ linux-2.6.21/drivers/s390/guest/vnet_port_guest.c
 @@ -0,0 +1,302 @@
 +/*
 + *  Copyright (C) 2005 IBM Corporation
 + *  Authors: Carsten Otte [EMAIL PROTECTED]
 + *   Christian Borntraeger [EMAIL PROTECTED]
 + *
 + */
 +#include linux/etherdevice.h
 +#include linux/fs.h
 +#include linux/kernel.h
 +#include linux/list.h
 +#include linux/module.h
 +#include linux/pagemap.h
 +#include linux/poll.h
 +#include linux/spinlock.h
 +
 +#include vnet.h
 +#include vnet_port_guest.h
 +#include vnet_switch.h
 +
 +static void COFIXME_add_irq(struct vnet_guest_port *zgp, int data)
 +{
 + int oldval, newval;
 +
 + do {
 + oldval = atomic_read(zgp-pending_irqs);
 + newval = oldval | data;
 + } while (atomic_cmpxchg(zgp-pending_irqs, oldval, newval) != oldval);
 +}
 +
 +static int COFIXME_get_irq(struct vnet_guest_port *zgp)
 +{
 + int oldval;
 +
 + do {
 + oldval = atomic_read(zgp-pending_irqs);
 + } while (atomic_cmpxchg(zgp-pending_irqs, oldval, 0) != oldval);
 +
 + return oldval;
 +}
 +
 +static void
 +vnet_guest_interrupt(struct vnet_port *port, int type)
 +{
 + struct vnet_guest_port *priv;
 +
 + priv = port-priv;
 +
 + if (!priv-fasync) {
 + printk (KERN_WARNING vnet: cannot send interrupt,
 + fd not async\n);
 + return;
 + }
 + switch (type) {
 + case VNET_IRQ_START_RX:
 + COFIXME_add_irq(priv, POLLIN);
 + kill_fasync(priv-fasync, SIGIO, POLL_IN);
 + break;
 + case VNET_IRQ_START_TX:
 + COFIXME_add_irq(priv, POLLOUT);
 + kill_fasync(priv-fasync, SIGIO, POLL_OUT);
 + break;
 + default:
 + BUG();
 + }
 +}
 +
 +/* release all pinned user pages*/
 +static void
 +vnet_guest_release_pages(struct vnet_port *port)
 +{
 + int i,j;
 +
 + for (i=0; iVNET_QUEUE_LEN; i++)
 + for (j=0; jVNET_BUFFER_PAGES; j++) {
 + if (port-s2p_data[i][j]) {
 + 
 page_cache_release(virt_to_page(port-s2p_data[i][j]));
 + port-s2p_data[i][j] = NULL;
 + }
 + if (port-p2s_data[i][j]) {
 + 
 page_cache_release(virt_to_page(port-p2s_data[i][j]));
 + port-p2s_data[i][j] = NULL;
 + }
 + }
 + if (port-control) {
 + page_cache_release(virt_to_page(port-control));
 + port-control = NULL;
 + }
 +}
 +
 +static int
 +vnet_chr_open(struct inode *ino, struct file *filp)
 +{
 + int minor;
 + struct vnet_port *port;
 + char name[BUS_ID_SIZE];
 +
 + minor = iminor(filp-f_dentry-d_inode);
 + snprintf(name, BUS_ID_SIZE, guest:%d, current-pid);
 + port = vnet_port_get(minor, name);
 + if (!port)
 + return -ENODEV;
 + port-priv = kzalloc(sizeof(struct vnet_guest_port), GFP_KERNEL);
 + if (!port-priv) {
 + vnet_port_put(port);
 + return -ENOMEM;
 + }
 + port-interrupt = vnet_guest_interrupt;
 + filp-private_data = port;
 + return nonseekable_open(ino, filp);
 +}
 +
 +static int
 +vnet_chr_release (struct inode *ino, struct file *filp)
 +{
 + struct 

[kvm-devel] [PATCH 0/4] in-kernel APIC v3a (usermode side)

2007-05-11 Thread Gregory Haskins
I re-worked the QEMU patches based on feedback from Dor and Anthony.  Here is
the changelog:

1) Got rid of the extern kvm_context from qemu/pc/apic.c.  This function is
   now wrapped by qemu-kvm which assigns proper kvm_context on behalf of the
   caller.  

2) Added support for a command line option: --kvm_apic [0 | 1 | 2].  The
   system defaults to level-1 mode (KVM based LAPIC).  Level-0 (QEMU based
   LAPIC) is also supported.  Level-2 is not supported yet, TBD.

3) Added the idea that Anthony proposed to have kvm_allowed=0 defined, even
if USE_KVM is not.

4) Cleaned up indentation

5) Cleaned up support for level-0 mode.

I have tested this code (in conjunction with the v3 kernel-patch) against

A) 32 bit XP w/ACPI
B) 64 bit SLED-10 (2.6.16 based)

Any everything seems to be working great.

Note that the current git-HEAD of the userspace code seems to break pretty
badly for linux right now.  I am testing exclusively on Intel chips (5130
Woodcrest and T7600 Merom), so YMMV.  As such, these patches apply to git  
7b9ee2382b07e955cc62a564406e3d9c4a08de6c.

Any feedback at all would be appreciated (particularly news of succesfull
testing :).  Thanks!

Regards,
-Greg

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 1/4] KVM: Updates for compiling in-kernel APIC support with external-modules

2007-05-11 Thread Gregory Haskins
Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 kernel/Kbuild |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/Kbuild b/kernel/Kbuild
index e9bcda7..103a179 100644
--- a/kernel/Kbuild
+++ b/kernel/Kbuild
@@ -1,5 +1,5 @@
 EXTRA_CFLAGS := -I$(src)/include -include $(src)/external-module-compat.h
 obj-m := kvm.o kvm-intel.o kvm-amd.o
-kvm-objs := kvm_main.o mmu.o x86_emulate.o
+kvm-objs := kvm_main.o mmu.o x86_emulate.o userint.o kernint.o lapic.o
 kvm-intel-objs := vmx.o vmx-debug.o
 kvm-amd-objs := svm.o


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 2/4] KVM-USER: Make the kvm_allowed flag always defined so we dont need #ifdefs

2007-05-11 Thread Gregory Haskins
Non-performance critical code is made more awkward by having to always define
both #ifdef KVM and if (kvm_allowed).  Define kvm_allowed = 0 by
default.  Anthony Ligouri is credited with the idea.

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 qemu/qemu-kvm.c |9 -
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c
index 212570a..d4419a3 100644
--- a/qemu/qemu-kvm.c
+++ b/qemu/qemu-kvm.c
@@ -3,6 +3,14 @@
 #include config-host.h
 
 #ifdef USE_KVM
+ #define KVM_ALLOWED_DEFAULT 1
+#else
+ #define KVM_ALLOWED_DEFAULT 0
+#endif
+
+int kvm_allowed = KVM_ALLOWED_DEFAULT;
+
+#ifdef USE_KVM
 
 #include exec.h
 
@@ -14,7 +22,6 @@
 
 extern void perror(const char *s);
 
-int kvm_allowed = 1;
 kvm_context_t kvm_context;
 static struct kvm_msr_list *kvm_msr_list;
 static int kvm_has_msr_star;


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 3/4] KVM-USER: Add ability to specify APIC emulation type from the command-line

2007-05-11 Thread Gregory Haskins
Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 qemu/qemu-kvm.c |1 +
 qemu/vl.c   |5 +
 qemu/vl.h   |1 +
 3 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c
index d4419a3..faa4684 100644
--- a/qemu/qemu-kvm.c
+++ b/qemu/qemu-kvm.c
@@ -9,6 +9,7 @@
 #endif
 
 int kvm_allowed = KVM_ALLOWED_DEFAULT;
+int kvm_apic_level = 1;
 
 #ifdef USE_KVM
 
diff --git a/qemu/vl.c b/qemu/vl.c
index 7df1c80..88e650e 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -6531,6 +6531,7 @@ enum {
 QEMU_OPTION_vnc,
 QEMU_OPTION_no_acpi,
 QEMU_OPTION_no_kvm,
+QEMU_OPTION_kvm_apic,
 QEMU_OPTION_no_reboot,
 QEMU_OPTION_daemonize,
 QEMU_OPTION_option_rom,
@@ -6600,6 +6601,7 @@ const QEMUOption qemu_options[] = {
 #endif
 #ifdef USE_KVM
 { no-kvm, 0, QEMU_OPTION_no_kvm },
+{ kvm_apic, HAS_ARG, QEMU_OPTION_kvm_apic },
 #endif
 #if defined(TARGET_PPC) || defined(TARGET_SPARC)
 { g, 1, QEMU_OPTION_g },
@@ -7309,6 +7311,9 @@ int main(int argc, char **argv)
case QEMU_OPTION_no_kvm:
kvm_allowed = 0;
break;
+   case QEMU_OPTION_kvm_apic:
+   kvm_apic_level = optarg;
+   break;
 #endif
 case QEMU_OPTION_usb:
 usb_enabled = 1;
diff --git a/qemu/vl.h b/qemu/vl.h
index debd17c..dec410e 100644
--- a/qemu/vl.h
+++ b/qemu/vl.h
@@ -158,6 +158,7 @@ extern int graphic_depth;
 extern const char *keyboard_layout;
 extern int kqemu_allowed;
 extern int kvm_allowed;
+extern int kvm_apic_level;
 extern int win2k_install_hack;
 extern int usb_enabled;
 extern int smp_cpus;


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 4/4] KVM: in-kernel-apic modification to QEMU

2007-05-11 Thread Gregory Haskins
Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 qemu/hw/apic.c  |   20 +++-
 qemu/hw/pc.c|   30 +-
 qemu/qemu-kvm.c |   49 +++--
 qemu/qemu-kvm.h |2 ++
 qemu/vl.c   |2 +-
 qemu/vl.h   |2 +-
 user/kvmctl.c   |   33 -
 user/kvmctl.h   |   31 ++-
 user/main.c |2 +-
 9 files changed, 138 insertions(+), 33 deletions(-)

diff --git a/qemu/hw/apic.c b/qemu/hw/apic.c
index 0b73233..9ac9ae4 100644
--- a/qemu/hw/apic.c
+++ b/qemu/hw/apic.c
@@ -18,6 +18,7 @@
  * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
  */
 #include vl.h
+#include qemu-kvm.h
 
 //#define DEBUG_APIC
 //#define DEBUG_IOAPIC
@@ -87,6 +88,7 @@ typedef struct APICState {
 } APICState;
 
 struct IOAPICState {
+CPUState *cpu_env;
 uint8_t id;
 uint8_t ioregsel;
 
@@ -888,10 +890,17 @@ static void ioapic_service(IOAPICState *s)
 vector = pic_read_irq(isa_pic);
 else
 vector = entry  0xff;
-
-apic_get_delivery_bitmask(deliver_bitmask, dest, dest_mode);
-apic_bus_deliver(deliver_bitmask, delivery_mode, 
- vector, polarity, trig_mode);
+ 
+   if (kvm_allowed  kvm_apic_level) {
+   ext_apic_bus_deliver(dest, trig_mode, dest_mode,
+delivery_mode, vector);
+   cpu_interrupt(s-cpu_env, CPU_INTERRUPT_HARD);
+   } else {
+   apic_get_delivery_bitmask(deliver_bitmask, dest,
+ dest_mode);
+   apic_bus_deliver(deliver_bitmask, delivery_mode, 
+vector, polarity, trig_mode);
+   }
 }
 }
 }
@@ -1045,7 +1054,7 @@ static CPUWriteMemoryFunc *ioapic_mem_write[3] = {
 ioapic_mem_writel,
 };
 
-IOAPICState *ioapic_init(void)
+IOAPICState *ioapic_init(CPUState *env)
 {
 IOAPICState *s;
 int io_memory;
@@ -1054,6 +1063,7 @@ IOAPICState *ioapic_init(void)
 if (!s)
 return NULL;
 ioapic_reset(s);
+s-cpu_env = env;
 s-id = last_apic_id++;
 
 io_memory = cpu_register_io_memory(0, ioapic_mem_read, 
diff --git a/qemu/hw/pc.c b/qemu/hw/pc.c
index eda49cf..b033637 100644
--- a/qemu/hw/pc.c
+++ b/qemu/hw/pc.c
@@ -91,16 +91,19 @@ int cpu_get_pic_interrupt(CPUState *env)
 {
 int intno;
 
-intno = apic_get_interrupt(env);
-if (intno = 0) {
-/* set irq request if a PIC irq is still pending */
-/* XXX: improve that */
-pic_update_irq(isa_pic); 
-return intno;
+if (!kvm_allowed || !kvm_apic_level) {
+   intno = apic_get_interrupt(env);
+   if (intno = 0) {
+   /* set irq request if a PIC irq is still pending */
+   /* XXX: improve that */
+   pic_update_irq(isa_pic); 
+   return intno;
+   }
+   
+   /* read the irq from the PIC */
+   if (!apic_accept_pic_intr(env))
+   return -1;
 }
-/* read the irq from the PIC */
-if (!apic_accept_pic_intr(env))
-return -1;
 
 intno = pic_read_irq(isa_pic);
 return intno;
@@ -483,9 +486,10 @@ static void pc_init1(int ram_size, int vga_ram_size, int 
boot_device,
 }
 register_savevm(cpu, i, 4, cpu_save, cpu_load, env);
 qemu_register_reset(main_cpu_reset, env);
-if (pci_enabled) {
-apic_init(env);
-}
+   if (!kvm_allowed || !kvm_apic_level)
+   if (pci_enabled) {
+   apic_init(env);
+   }
 }
 
 /* allocate RAM */
@@ -671,7 +675,7 @@ static void pc_init1(int ram_size, int vga_ram_size, int 
boot_device,
 register_ioport_write(0x92, 1, 1, ioport92_write, NULL);
 
 if (pci_enabled) {
-ioapic = ioapic_init();
+ioapic = ioapic_init(env);
 }
 isa_pic = pic_init(pic_irq_request, first_cpu);
 pit = pit_init(0x40, 0);
diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c
index faa4684..03152e1 100644
--- a/qemu/qemu-kvm.c
+++ b/qemu/qemu-kvm.c
@@ -235,9 +235,16 @@ static void load_regs(CPUState *env)
 sregs.cr3 = env-cr[3];
 sregs.cr4 = env-cr[4];
 
-sregs.apic_base = cpu_get_apic_base(env);
+if (!kvm_apic_level) {
+   /* These two are no longer used once the in-kernel APIC is enabled */
+   sregs.apic_base = 0;
+   sregs.cr8 = 0;
+} else {
+   sregs.apic_base = cpu_get_apic_base(env);
+   sregs.cr8 = cpu_get_apic_tpr(env);
+}
+
 sregs.efer = env-efer;
-sregs.cr8 = cpu_get_apic_tpr(env);
 
 kvm_set_sregs(kvm_context, 0, sregs);
 
@@ -329,10 +336,12 @@ static void save_regs(CPUState *env)
 env-cr[3] = sregs.cr3;
 env-cr[4] = sregs.cr4;
 
-cpu_set_apic_base(env, 

Re: [kvm-devel] [PATCH/RFC 8/9] Virtual network host switch support

2007-05-11 Thread Christian Bornträger
On Friday 11 May 2007 22:21, Anthony Liguori wrote:
 Any feel for the performance relative to the bridging code?  The 
 bridging code is a pretty big bottle neck in guest=guest communications 
 in Xen at least.

Last time I checked it we had a quite decent guest to guest performance in the 
gigabits/sec.
On the downside the switch is quite aggressive with dropping packages as the 
inbound buffer of the virtual network adapters has space for 80 packets. 
(that can be changed)

 
  currently tested but not ready yet. We did not use the linux bridging code 
to
  allow non-root users to create virtual networks between guests. 

 
 Is that the primary reason?  If so, that seems like a rather large 
 hammer for something that a userspace suid wrapper could have addressed...

Actually there are some reasons why we did not use the bridging code:

- One thing is, that a lot of OSA network cards do not support promiscous 
mode. There is also the issue that a lot of OSA cards are in layer 3 mode (we 
get IP packets and no ethernet frames) so bridging wont work to the host 
interface.
- non-root switches
- the performance of bridging (we copy directly from one guest buffer to 
another without allocating an skb on the host)
- we considered to hook into the qeth driver (for OSA cards) to deal with 
layer3 mode.

The first shot was actually a point-to-point driver (guest netif -- host 
netif). We added the switch at a later time. 

Hmm, if we can make bridging work (with a decent performance) on s390 that 
would reduce the maintainance work for us as this network switch is far from 
being complete. 

cheers

Christian

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 4/4] KVM: in-kernel-apic modification to QEMU

2007-05-11 Thread Gregory Haskins
This has the latest feedback from Anthony incorporated

Signed-off-by: Gregory Haskins [EMAIL PROTECTED]
---

 qemu/hw/apic.c  |   20 +++-
 qemu/hw/pc.c|   29 -
 qemu/qemu-kvm.c |   49 +++--
 qemu/qemu-kvm.h |2 ++
 qemu/vl.c   |2 +-
 qemu/vl.h   |7 ++-
 user/kvmctl.c   |   33 -
 user/kvmctl.h   |   31 ++-
 user/main.c |2 +-
 9 files changed, 142 insertions(+), 33 deletions(-)

diff --git a/qemu/hw/apic.c b/qemu/hw/apic.c
index 0b73233..5665057 100644
--- a/qemu/hw/apic.c
+++ b/qemu/hw/apic.c
@@ -18,6 +18,7 @@
  * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
  */
 #include vl.h
+#include qemu-kvm.h
 
 //#define DEBUG_APIC
 //#define DEBUG_IOAPIC
@@ -87,6 +88,7 @@ typedef struct APICState {
 } APICState;
 
 struct IOAPICState {
+CPUState *cpu_env;
 uint8_t id;
 uint8_t ioregsel;
 
@@ -888,10 +890,17 @@ static void ioapic_service(IOAPICState *s)
 vector = pic_read_irq(isa_pic);
 else
 vector = entry  0xff;
-
-apic_get_delivery_bitmask(deliver_bitmask, dest, dest_mode);
-apic_bus_deliver(deliver_bitmask, delivery_mode, 
- vector, polarity, trig_mode);
+ 
+   if (kvm_allowed  kvm_apic_level) {
+   ext_apic_bus_deliver(dest, trig_mode, dest_mode,
+delivery_mode, vector);
+   cpu_interrupt(s-cpu_env, CPU_INTERRUPT_HARD);
+   } else {
+   apic_get_delivery_bitmask(deliver_bitmask, dest,
+ dest_mode);
+   apic_bus_deliver(deliver_bitmask, delivery_mode, 
+vector, polarity, trig_mode);
+   }
 }
 }
 }
@@ -1045,7 +1054,7 @@ static CPUWriteMemoryFunc *ioapic_mem_write[3] = {
 ioapic_mem_writel,
 };
 
-IOAPICState *ioapic_init(void)
+IOAPICState *ioapic_init(CPUState *env)
 {
 IOAPICState *s;
 int io_memory;
@@ -1054,6 +1063,7 @@ IOAPICState *ioapic_init(void)
 if (!s)
 return NULL;
 ioapic_reset(s);
+s-cpu_env = env;
 s-id = last_apic_id++;
 
 io_memory = cpu_register_io_memory(0, ioapic_mem_read, 
diff --git a/qemu/hw/pc.c b/qemu/hw/pc.c
index eda49cf..618cc32 100644
--- a/qemu/hw/pc.c
+++ b/qemu/hw/pc.c
@@ -91,16 +91,19 @@ int cpu_get_pic_interrupt(CPUState *env)
 {
 int intno;
 
-intno = apic_get_interrupt(env);
-if (intno = 0) {
-/* set irq request if a PIC irq is still pending */
-/* XXX: improve that */
-pic_update_irq(isa_pic); 
-return intno;
+if (!use_kernel_apic()) {
+   intno = apic_get_interrupt(env);
+   if (intno = 0) {
+   /* set irq request if a PIC irq is still pending */
+   /* XXX: improve that */
+   pic_update_irq(isa_pic); 
+   return intno;
+   }
+   
+   /* read the irq from the PIC */
+   if (!apic_accept_pic_intr(env))
+   return -1;
 }
-/* read the irq from the PIC */
-if (!apic_accept_pic_intr(env))
-return -1;
 
 intno = pic_read_irq(isa_pic);
 return intno;
@@ -483,9 +486,9 @@ static void pc_init1(int ram_size, int vga_ram_size, int 
boot_device,
 }
 register_savevm(cpu, i, 4, cpu_save, cpu_load, env);
 qemu_register_reset(main_cpu_reset, env);
-if (pci_enabled) {
-apic_init(env);
-}
+   if (!use_kernel_apic()  pci_enabled) {
+   apic_init(env);
+   }
 }
 
 /* allocate RAM */
@@ -671,7 +674,7 @@ static void pc_init1(int ram_size, int vga_ram_size, int 
boot_device,
 register_ioport_write(0x92, 1, 1, ioport92_write, NULL);
 
 if (pci_enabled) {
-ioapic = ioapic_init();
+ioapic = ioapic_init(env);
 }
 isa_pic = pic_init(pic_irq_request, first_cpu);
 pit = pit_init(0x40, 0);
diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c
index faa4684..59e79bf 100644
--- a/qemu/qemu-kvm.c
+++ b/qemu/qemu-kvm.c
@@ -235,9 +235,16 @@ static void load_regs(CPUState *env)
 sregs.cr3 = env-cr[3];
 sregs.cr4 = env-cr[4];
 
-sregs.apic_base = cpu_get_apic_base(env);
+if (!kvm_apic_level) {
+   /* These two are no longer used once the in-kernel APIC is enabled */
+   sregs.apic_base = 0;
+   sregs.cr8 = 0;
+} else {
+   sregs.apic_base = cpu_get_apic_base(env);
+   sregs.cr8 = cpu_get_apic_tpr(env);
+}
+
 sregs.efer = env-efer;
-sregs.cr8 = cpu_get_apic_tpr(env);
 
 kvm_set_sregs(kvm_context, 0, sregs);
 
@@ -329,10 +336,12 @@ static void save_regs(CPUState *env)
 env-cr[3] = sregs.cr3;
 env-cr[4] = sregs.cr4;
 
-cpu_set_apic_base(env, 

Re: [kvm-devel] [PATCH/RFC 7/9] Virtual network guest device driver

2007-05-11 Thread Anthony Liguori
Eric Van Hensbergen wrote:
 On 5/11/07, Anthony Liguori [EMAIL PROTECTED] wrote:
   
 cpu% ls /net/ether0
 /net/ether0/0
 /net/ether0/1
 /net/ether0/2
 /net/ether0/addr
 /net/ether0/clone
 /net/ether0/ifstats
 /net/ether0/stats

   
 This smells a bit like XenStore which I think most will agree was an
 unmitigated disaster.

 

 I'd have to disagree with you Anthony.  The Plan 9 interfaces are
 simple and built into the kernel - they don't have the
 multi-layered-stack-python-xmlrpc garbage that made up the Xen
 interfaces.
   

My point isn't that 9p is just like XenStore but rather that turning 
this idea into something that is useful and elegant is non-trivial.

 If it were just console access, I would agree with you, but its really
 about implementing a single solution for all drivers you are accessing
 across the interface.  A single client versus dozens of different
 driver variants.

There's definitely a conversation to have here.  There are going to be a 
lot of small devices that would benefit from a common transport 
mechanism.  Someone mentioned a PV entropy device on LKML.  A 
host=guest filesystem is another consumer of such an interface.

I'm inclined to think though that the abstraction point should be the 
transport and not the actual protocol.  My concern with standardizing on 
a protocol like 9p would be that one would lose some potential 
optimizations (like passing PFN's directly between guest and host).

   Our existing 9p client for mini-os is ~3000 LOC and
 it is a pretty naive port from the p9p code base so it could probably
 be reduced even further.  It is a very small percentage of our
 existing mini-os kernels and gives us console, disk, network, IP
 stack, file system, and control interfaces.  Of course Linux clients
 could just use v9fs with a hypervisor-shared-memory transport which I
 haven't merged yet.  We'll also be using the same set of interfaces
 for the simulator shortly.
   

So is there any reason to even tie 9p to KVM?  Why not just have a 
common PV transport that 9p can use.  For certain things, it may make 
sense (like v9fs).

Regards,

Anthony Liguori

 Oh yeah, and don't forget the fact that resource access can bridge
 seamlessly over any network and the protocol has provisions to be
 secured with authentication/encryption/digesting if desired.

 Los Alamos will be presenting 9p based control interfaces for KVM at OLS.

 -eric

 -
 This SF.net email is sponsored by DB2 Express
 Download DB2 Express C - the FREE version of DB2 express and take
 control of your XML. No limits. Just data. Click to get it now.
 http://sourceforge.net/powerbar/db2/
 ___
 kvm-devel mailing list
 kvm-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/kvm-devel

   


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel