Re: [PATCH/RFC] kvm: fix refcounting race release vs. module unload

2008-11-24 Thread Christian Borntraeger
  --- kvm.orig/virt/kvm/kvm_main.c
  +++ kvm/virt/kvm/kvm_main.c
  @@ -1303,7 +1303,7 @@ static int kvm_vcpu_release(struct inode
  return 0;
   }
   
  -static const struct file_operations kvm_vcpu_fops = {
  +static struct file_operations kvm_vcpu_fops = {
  .release= kvm_vcpu_release,
  .unlocked_ioctl = kvm_vcpu_ioctl,
  .compat_ioctl   = kvm_vcpu_ioctl,
  @@ -1318,6 +1318,7 @@ static int create_vcpu_fd(struct kvm_vcp
  int fd = anon_inode_getfd(kvm-vcpu, kvm_vcpu_fops, vcpu, 0);
  if (fd  0)
  kvm_put_kvm(vcpu-kvm);
  +   __module_get(kvm_vcpu_fops.owner);
  return fd;
   }
   
  @@ -2061,6 +2062,7 @@ int kvm_init(void *opaque, unsigned int 
  }
   
  kvm_chardev_ops.owner = module;
  +   kvm_vcpu_fops.owner = module;
   
  r = misc_register(kvm_dev);
  if (r) {

 
 Messing with module counts is slightly ugly. How about having a vm fd 
 fget() the /dev/kvm fd() instead?

I personally find fget (and fput) slightly more ugly than handling the module 
reference count. Especially if the problem is module unloading...the module 
refcount looks so natural.

I am also a bit worried by fget/fput, since we would call fput in the release 
function - which is part of the module. Wouldnt that open another very small 
race?

In addition, we would need variables containing the fd and the file pointer 
for /dev/kvm, since fget/fput need some parameters, no? (Is there an easy way 
to get the fd from the struct file *filp? Searching current-files, seems to
be the only method I know)

To me, the fget approach looks more complicated and less safe.

Christian
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: VMX: Fix race between pending IRQ and NMI

2008-11-24 Thread Jan Kiszka
Avi Kivity wrote:
 Jan Kiszka wrote:
 But I think I see a bigger issue - if we inject an regular interrupt
 while another is pending, then we will encounter this problem.  Looks
 like we have to enable the interrupt window after injecting an interrupt
 if there are still pending interrupts.
 

 Yeah, probably. I'm just wondering now if we can set
 exit-on-interrupt-window while the vcpu state is interruptible (ie.
 _before_ the injection). There is some entry check like this for NMIs,
 but maybe no for interrupts. Need to check.
   
 
 Turns out it's not necessary, since the guest eoi will cause an exit and
 allow the code to request an interrupt window.

But you added explicit handling now nevertheless?

 
 I've added an apic test program so we can track these issues
 (user/test/x86/apic.c).
 

That's good. BTW, your NMI race fix is still lacking support for the
-no-kvm-irqchip case. Will post an according patch later today.

Jan

-- 
Siemens AG, Corporate Technology, CT SE 2 ES-OS
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm-testsuite: Fix halt callback

2008-11-24 Thread Jan Kiszka
Change halt callback in testsuite to conform with latest refactorings.

Signed-off-by: Jan Kiszka [EMAIL PROTECTED]
---

 user/main.c |5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/user/main.c b/user/main.c
index a00b073..55639b5 100644
--- a/user/main.c
+++ b/user/main.c
@@ -304,13 +304,12 @@ static int test_debug(void *opaque, void *vcpu)
return 0;
 }
 
-static int test_halt(void *opaque, void *_vcpu)
+static int test_halt(void *opaque, int vcpu)
 {
-   struct vcpu_info *vcpu = _vcpu;
int n;
 
sigwait(ipi_sigmask, n);
-   kvm_inject_irq(kvm, vcpu-id, apic_ipi_vector);
+   kvm_inject_irq(kvm, vcpus[vcpu].id, apic_ipi_vector);
return 0;
 }
 
-- 
Siemens AG, Corporate Technology, CT SE 2 ES-OS
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: x86: Cleanup user space NMI injection

2008-11-24 Thread Jan Kiszka
There is no point in doing the ready_for_nmi_injection/
request_nmi_window dance with user space. First, we don't do this for
in-kernel irqchip anyway, while the code path is the same as for user
space irqchip mode. And second, there is nothing to loose if a pending
NMI is overwritten by another one (in contrast to IRQs where we have to
save the number). Actually, there is even the risk of raising spurious
NMIs this way because the reason for the held-back NMI might already be
handled while processing the first one.

[ Avi, how to deal with the fields in struct kvm_run and the exit
reason? They are not mainline yet, neither in linux nor in qemu, and I
don't think they should ever be pushed in their current form. Simply
revert them? ]

Signed-off-by: Jan Kiszka [EMAIL PROTECTED]
---

 arch/x86/kvm/vmx.c  |   24 ++--
 arch/x86/kvm/x86.c  |   34 --
 include/linux/kvm.h |6 +++---
 3 files changed, 13 insertions(+), 51 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 775a140..6fbff55 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2498,15 +2498,13 @@ static void do_interrupt_requests(struct kvm_vcpu *vcpu,
}
if (vcpu-arch.nmi_injected) {
vmx_inject_nmi(vcpu);
-   if (vcpu-arch.nmi_pending || kvm_run-request_nmi_window)
+   if (vcpu-arch.nmi_pending)
enable_nmi_window(vcpu);
else if (vcpu-arch.irq_summary
 || kvm_run-request_interrupt_window)
enable_irq_window(vcpu);
return;
}
-   if (!vcpu-arch.nmi_window_open || kvm_run-request_nmi_window)
-   enable_nmi_window(vcpu);
 
if (vcpu-arch.interrupt_window_open) {
if (vcpu-arch.irq_summary  !vcpu-arch.interrupt.pending)
@@ -3040,14 +3038,6 @@ static int handle_nmi_window(struct kvm_vcpu *vcpu, 
struct kvm_run *kvm_run)
vmcs_write32(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control);
++vcpu-stat.nmi_window_exits;
 
-   /*
-* If the user space waits to inject a NMI, exit as soon as possible
-*/
-   if (kvm_run-request_nmi_window  !vcpu-arch.nmi_pending) {
-   kvm_run-exit_reason = KVM_EXIT_NMI_WINDOW_OPEN;
-   return 0;
-   }
-
return 1;
 }
 
@@ -3162,7 +3152,7 @@ static int kvm_handle_exit(struct kvm_run *kvm_run, 
struct kvm_vcpu *vcpu)
vmx-soft_vnmi_blocked = 0;
vcpu-arch.nmi_window_open = 1;
} else if (vmx-vnmi_blocked_time  10LL 
-   (kvm_run-request_nmi_window || vcpu-arch.nmi_pending)) {
+  vcpu-arch.nmi_pending) {
/*
 * This CPU don't support us in finding the end of an
 * NMI-blocked window if the guest runs with IRQs
@@ -3175,16 +3165,6 @@ static int kvm_handle_exit(struct kvm_run *kvm_run, 
struct kvm_vcpu *vcpu)
vmx-soft_vnmi_blocked = 0;
vmx-vcpu.arch.nmi_window_open = 1;
}
-
-   /*
-* If the user space waits to inject an NNI, exit ASAP
-*/
-   if (vcpu-arch.nmi_window_open  kvm_run-request_nmi_window
-!vcpu-arch.nmi_pending) {
-   kvm_run-exit_reason = KVM_EXIT_NMI_WINDOW_OPEN;
-   ++vcpu-stat.nmi_window_exits;
-   return 0;
-   }
}
 
if (exit_reason  kvm_vmx_max_exit_handlers
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7a2aeba..a5da129 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2885,37 +2885,18 @@ static int dm_request_for_irq_injection(struct kvm_vcpu 
*vcpu,
(kvm_x86_ops-get_rflags(vcpu)  X86_EFLAGS_IF));
 }
 
-/*
- * Check if userspace requested a NMI window, and that the NMI window
- * is open.
- *
- * No need to exit to userspace if we already have a NMI queued.
- */
-static int dm_request_for_nmi_injection(struct kvm_vcpu *vcpu,
-   struct kvm_run *kvm_run)
-{
-   return (!vcpu-arch.nmi_pending 
-   kvm_run-request_nmi_window 
-   vcpu-arch.nmi_window_open);
-}
-
 static void post_kvm_run_save(struct kvm_vcpu *vcpu,
  struct kvm_run *kvm_run)
 {
kvm_run-if_flag = (kvm_x86_ops-get_rflags(vcpu)  X86_EFLAGS_IF) != 0;
kvm_run-cr8 = kvm_get_cr8(vcpu);
kvm_run-apic_base = kvm_get_apic_base(vcpu);
-   if (irqchip_in_kernel(vcpu-kvm)) {
+   if (irqchip_in_kernel(vcpu-kvm))
kvm_run-ready_for_interrupt_injection = 1;
-   kvm_run-ready_for_nmi_injection = 1;
-   } else {
+   else
kvm_run-ready_for_interrupt_injection =

[PATCH] KVM: VMX: Fix pending NMI-vs.-IRQ race for user space irqchip

2008-11-24 Thread Jan Kiszka
Push b55a50582030cf294a675492d7ab2e235b965cc8 and
d3a2c20c9b850d92dae383fd6a64840de2687cd6 also to the user space irqchip
path.

Signed-off-by: Jan Kiszka [EMAIL PROTECTED]
---

 arch/x86/kvm/vmx.c |4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 7ea4855..775a140 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2486,7 +2486,9 @@ static void do_interrupt_requests(struct kvm_vcpu *vcpu,
vmx_update_window_states(vcpu);
 
if (vcpu-arch.nmi_pending  !vcpu-arch.nmi_injected) {
-   if (vcpu-arch.nmi_window_open) {
+   if (vcpu-arch.interrupt.pending) {
+   enable_nmi_window(vcpu);
+   } else if (vcpu-arch.nmi_window_open) {
vcpu-arch.nmi_pending = false;
vcpu-arch.nmi_injected = true;
} else {
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] kvm: Replace force type convert with container_of()

2008-11-24 Thread Sheng Yang

Signed-off-by: Sheng Yang [EMAIL PROTECTED]
---
 qemu/hw/device-assignment.c |   20 
 1 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/qemu/hw/device-assignment.c b/qemu/hw/device-assignment.c
index 9a790c6..786b2f0 100644
--- a/qemu/hw/device-assignment.c
+++ b/qemu/hw/device-assignment.c
@@ -144,7 +144,7 @@ static uint32_t assigned_dev_ioport_readl(void *opaque, 
uint32_t addr)
 static void assigned_dev_iomem_map(PCIDevice *pci_dev, int region_num,
uint32_t e_phys, uint32_t e_size, int type)
 {
-AssignedDevice *r_dev = (AssignedDevice *) pci_dev;
+AssignedDevice *r_dev = container_of(pci_dev, AssignedDevice, dev);
 AssignedDevRegion *region = r_dev-v_addrs[region_num];
 uint32_t old_ephys = region-e_physbase;
 uint32_t old_esize = region-e_size;
@@ -172,7 +172,7 @@ static void assigned_dev_iomem_map(PCIDevice *pci_dev, int 
region_num,
 static void assigned_dev_ioport_map(PCIDevice *pci_dev, int region_num,
 uint32_t addr, uint32_t size, int type)
 {
-AssignedDevice *r_dev = (AssignedDevice *) pci_dev;
+AssignedDevice *r_dev = container_of(pci_dev, AssignedDevice, dev);
 AssignedDevRegion *region = r_dev-v_addrs[region_num];
 int first_map = (region-e_size == 0);
 CPUState *env;
@@ -221,6 +221,7 @@ static void assigned_dev_pci_write_config(PCIDevice *d, 
uint32_t address,
 {
 int fd;
 ssize_t ret;
+AssignedDevice *pci_dev = container_of(d, AssignedDevice, dev);
 
 DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n,
   ((d-devfn  3)  0x1F), (d-devfn  0x7),
@@ -242,7 +243,7 @@ static void assigned_dev_pci_write_config(PCIDevice *d, 
uint32_t address,
   ((d-devfn  3)  0x1F), (d-devfn  0x7),
   (uint16_t) address, val, len);
 
-fd = ((AssignedDevice *)d)-real_device.config_fd;
+fd = pci_dev-real_device.config_fd;
 
 again:
 ret = pwrite(fd, val, len, address);
@@ -263,6 +264,7 @@ static uint32_t assigned_dev_pci_read_config(PCIDevice *d, 
uint32_t address,
 uint32_t val = 0;
 int fd;
 ssize_t ret;
+AssignedDevice *pci_dev = container_of(d, AssignedDevice, dev);
 
 if ((address = 0x10  address = 0x24) || address == 0x34 ||
 address == 0x3c || address == 0x3d) {
@@ -276,7 +278,7 @@ static uint32_t assigned_dev_pci_read_config(PCIDevice *d, 
uint32_t address,
 if (address == 0xFC)
 goto do_log;
 
-fd = ((AssignedDevice *)d)-real_device.config_fd;
+fd = pci_dev-real_device.config_fd;
 
 again:
 ret = pread(fd, val, len, address);
@@ -489,16 +491,18 @@ struct PCIDevice *init_assigned_device(AssignedDevInfo 
*adev, PCIBus *bus)
 {
 int r;
 AssignedDevice *dev;
+PCIDevice *pci_dev;
 uint8_t e_device, e_intx;
 struct kvm_assigned_pci_dev assigned_dev_data;
 
 DEBUG(Registering real physical device %s (bus=%x dev=%x func=%x)\n,
   adev-name, adev-bus, adev-dev, adev-func);
 
-dev = (AssignedDevice *)
-pci_register_device(bus, adev-name, sizeof(AssignedDevice),
--1, assigned_dev_pci_read_config,
-assigned_dev_pci_write_config);
+pci_dev = pci_register_device(bus, adev-name,
+  sizeof(AssignedDevice), -1, assigned_dev_pci_read_config,
+  assigned_dev_pci_write_config);
+dev = container_of(pci_dev, AssignedDevice, dev);
+
 if (NULL == dev) {
 fprintf(stderr, %s: Error: Couldn't register real device %s\n,
 __func__, adev-name);
-- 
1.5.4.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] Support for device capability

2008-11-24 Thread Sheng Yang
This framework can be easily extended to support device capability, like
MSI/MSI-x.

Signed-off-by: Sheng Yang [EMAIL PROTECTED]
---
 qemu/hw/pci.c |   85 +
 qemu/hw/pci.h |   30 
 2 files changed, 115 insertions(+), 0 deletions(-)

diff --git a/qemu/hw/pci.c b/qemu/hw/pci.c
index 75bc9a9..73f73da 100644
--- a/qemu/hw/pci.c
+++ b/qemu/hw/pci.c
@@ -339,11 +339,65 @@ static void pci_update_mappings(PCIDevice *d)
 }
 }
 
+int pci_access_cap_config(PCIDevice *pci_dev, uint32_t address, int len)
+{
+if (pci_dev-cap.supported  address = pci_dev-cap.start 
+(address + len)  pci_dev-cap.start + pci_dev-cap.length)
+return 1;
+return 0;
+}
+
+uint32_t pci_default_cap_read_config(PCIDevice *pci_dev,
+ uint32_t address, int len)
+{
+uint32_t val = 0;
+
+if (pci_access_cap_config(pci_dev, address, len)) {
+switch(len) {
+default:
+case 4:
+if (address  pci_dev-cap.start + pci_dev-cap.length - 4) {
+val = le32_to_cpu(*(uint32_t *)(pci_dev-cap.config
++ address - pci_dev-cap.start));
+break;
+}
+/* fall through */
+case 2:
+if (address  pci_dev-cap.start + pci_dev-cap.length - 2) {
+val = le16_to_cpu(*(uint16_t *)(pci_dev-cap.config
++ address - pci_dev-cap.start));
+break;
+}
+/* fall through */
+case 1:
+val = pci_dev-cap.config[address - pci_dev-cap.start];
+break;
+}
+}
+return val;
+}
+
+void pci_default_cap_write_config(PCIDevice *pci_dev,
+  uint32_t address, uint32_t val, int len)
+{
+if (pci_access_cap_config(pci_dev, address, len)) {
+int i;
+for (i = 0; i  len; i++) {
+pci_dev-cap.config[address + i - pci_dev-cap.start] = val;
+val = 8;
+}
+return;
+}
+}
+
 uint32_t pci_default_read_config(PCIDevice *d,
  uint32_t address, int len)
 {
 uint32_t val;
 
+if (pci_access_cap_config(d, address, len))
+return d-cap.config_read(d, address, len);
+
 switch(len) {
 default:
 case 4:
@@ -397,6 +451,11 @@ void pci_default_write_config(PCIDevice *d,
 return;
 }
  default_config:
+if (pci_access_cap_config(d, address, len)) {
+d-cap.config_write(d, address, val, len);
+return;
+}
+
 /* not efficient, but simple */
 addr = address;
 for(i = 0; i  len; i++) {
@@ -802,3 +861,29 @@ PCIBus *pci_bridge_init(PCIBus *bus, int devfn, uint32_t 
id,
 s-bus = pci_register_secondary_bus(s-dev, map_irq);
 return s-bus;
 }
+
+void pci_enable_capability_support(PCIDevice *pci_dev,
+   uint32_t config_start,
+   PCICapConfigReadFunc *config_read,
+   PCICapConfigWriteFunc *config_write,
+   PCICapConfigInitFunc *config_init)
+{
+if (!pci_dev)
+return;
+
+if (config_start = 0x40  config_start  0xff)
+pci_dev-cap.start = config_start;
+else
+pci_dev-cap.start = PCI_CAPABILITY_CONFIG_DEFAULT_START_ADDR;
+if (config_read)
+pci_dev-cap.config_read = config_read;
+else
+pci_dev-cap.config_read = pci_default_cap_read_config;
+if (config_write)
+pci_dev-cap.config_write = config_write;
+else
+pci_dev-cap.config_write = pci_default_cap_write_config;
+pci_dev-cap.supported = 1;
+pci_dev-config[0x34] = pci_dev-cap.start;
+config_init(pci_dev);
+}
diff --git a/qemu/hw/pci.h b/qemu/hw/pci.h
index e11fbbf..86b4ae5 100644
--- a/qemu/hw/pci.h
+++ b/qemu/hw/pci.h
@@ -19,6 +19,12 @@ typedef void PCIMapIORegionFunc(PCIDevice *pci_dev, int 
region_num,
 uint32_t addr, uint32_t size, int type);
 typedef int PCIUnregisterFunc(PCIDevice *pci_dev);
 
+typedef void PCICapConfigWriteFunc(PCIDevice *pci_dev,
+   uint32_t address, uint32_t val, int len);
+typedef uint32_t PCICapConfigReadFunc(PCIDevice *pci_dev,
+  uint32_t address, int len);
+typedef void PCICapConfigInitFunc(PCIDevice *pci_dev);
+
 #define PCI_ADDRESS_SPACE_MEM  0x00
 #define PCI_ADDRESS_SPACE_IO   0x01
 #define PCI_ADDRESS_SPACE_MEM_PREFETCH 0x08
@@ -46,6 +52,10 @@ typedef struct PCIIORegion {
 #define PCI_MIN_GNT0x3e/* 8 bits */
 #define PCI_MAX_LAT0x3f/* 8 bits */
 
+#define PCI_CAPABILITY_CONFIG_MAX_LENGTH 0x60
+#define PCI_CAPABILITY_CONFIG_DEFAULT_START_ADDR 0x40
+#define PCI_CAPABILITY_CONFIG_MSI_LENGTH 0x10
+
 struct PCIDevice {
 /* PCI config space */
 uint8_t config[256];
@@ -68,6 +78,15 @@ 

[PATCH 0/5][v2] Userspace for MSI support of KVM

2008-11-24 Thread Sheng Yang
Hi Avi  Anthony

Here is the userspace for MSI support of KVM.

Main change from v1:
Make device assignment depends on libpci.
Move capability framework to pci.c (this patch may can be accepted by QEmu).

Thanks!
--
regards
Yang, Sheng
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] Make device assignment depend on libpci

2008-11-24 Thread Sheng Yang
Which is used later for capability detection.

Signed-off-by: Sheng Yang [EMAIL PROTECTED]
---
 qemu/Makefile.target |1 +
 qemu/configure   |   20 
 2 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/qemu/Makefile.target b/qemu/Makefile.target
index 05ace8e..59653ba 100644
--- a/qemu/Makefile.target
+++ b/qemu/Makefile.target
@@ -735,6 +735,7 @@ OBJS += device-hotplug.o
 
 ifeq ($(USE_KVM_DEVICE_ASSIGNMENT), 1)
 OBJS+= device-assignment.o
+LIBS+=-lpci
 endif
 
 ifeq ($(TARGET_BASE_ARCH), i386)
diff --git a/qemu/configure b/qemu/configure
index 18ef980..bdde5ed 100755
--- a/qemu/configure
+++ b/qemu/configure
@@ -808,6 +808,26 @@ EOF
 fi
 fi
 
+# libpci probe for kvm_cap_device_assignment
+if test $kvm_cap_device_assignment = yes ; then
+cat  $TMPC  EOF
+#include pci/pci.h
+#ifndef PCI_VENDOR_ID
+#error NO LIBPCI
+#endif
+int main(void) { return 0; }
+EOF
+if $cc $ARCH_CFLAGS -o $TMPE ${OS_CFLAGS} $TMPC 2/dev/null ; then
+:
+else
+echo
+echo Error: libpci check failed
+echo Disable KVM Device Assignment capability.
+echo
+kvm_cap_device_assignment=no
+fi
+fi
+
 ##
 # zlib check
 
-- 
1.5.4.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] kvm: expose MSI capability to guest

2008-11-24 Thread Sheng Yang

Signed-off-by: Sheng Yang [EMAIL PROTECTED]
---
 qemu/hw/device-assignment.c |   90 +++---
 qemu/hw/device-assignment.h |2 +
 2 files changed, 85 insertions(+), 7 deletions(-)

diff --git a/qemu/hw/device-assignment.c b/qemu/hw/device-assignment.c
index d3105bc..67bd6b3 100644
--- a/qemu/hw/device-assignment.c
+++ b/qemu/hw/device-assignment.c
@@ -262,7 +262,8 @@ static void assigned_dev_pci_write_config(PCIDevice *d, 
uint32_t address,
 }
 
 if ((address = 0x10  address = 0x24) || address == 0x34 ||
-address == 0x3c || address == 0x3d) {
+address == 0x3c || address == 0x3d ||
+pci_access_cap_config(d, address, len)) {
 /* used for update-mappings (BAR emulation) */
 pci_default_write_config(d, address, val, len);
 return;
@@ -296,7 +297,8 @@ static uint32_t assigned_dev_pci_read_config(PCIDevice *d, 
uint32_t address,
 AssignedDevice *pci_dev = container_of(d, AssignedDevice, dev);
 
 if ((address = 0x10  address = 0x24) || address == 0x34 ||
-address == 0x3c || address == 0x3d) {
+address == 0x3c || address == 0x3d ||
+pci_access_cap_config(d, address, len)) {
 val = pci_default_read_config(d, address, len);
 DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n,
   (d-devfn  3)  0x1F, (d-devfn  0x7), address, val, len);
@@ -325,11 +327,13 @@ do_log:
 DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n,
   (d-devfn  3)  0x1F, (d-devfn  0x7), address, val, len);
 
-/* kill the special capabilities */
-if (address == 4  len == 4)
-val = ~0x10;
-else if (address == 6)
-val = ~0x10;
+if (!pci_dev-cap.available) {
+/* kill the special capabilities */
+if (address == 4  len == 4)
+val = ~0x10;
+else if (address == 6)
+val = ~0x10;
+}
 
 return val;
 }
@@ -537,6 +541,73 @@ void assigned_dev_update_irq(PCIDevice *d)
 }
 }
 
+#ifdef KVM_CAP_DEVICE_MSI
+static void assigned_dev_enable_msi(PCIDevice *pci_dev)
+{
+int r;
+struct kvm_assigned_irq assigned_irq_data;
+AssignedDevice *assigned_dev = container_of(pci_dev, AssignedDevice, dev);
+
+memset(assigned_irq_data, 0, sizeof assigned_irq_data);
+assigned_irq_data.assigned_dev_id  =
+calc_assigned_dev_id(assigned_dev-h_busnr,
+(uint8_t)assigned_dev-h_devfn);
+assigned_irq_data.guest_msi.addr_lo = *(uint32_t *)
+(pci_dev-cap.config + 4);
+assigned_irq_data.guest_msi.data = *(uint16_t *)
+(pci_dev-cap.config + 8);
+assigned_irq_data.flags |= KVM_DEV_IRQ_ASSIGN_ENABLE_MSI;
+r = kvm_assign_irq(kvm_context, assigned_irq_data);
+if (r  0) {
+perror(assigned_dev_enable_msi);
+assigned_dev-cap.enabled = ~ASSIGNED_DEVICE_MSI_ENABLED;
+/* Fail to enable MSI, enable INTx instead */
+assigned_dev_update_irq(pci_dev);
+}
+}
+#endif
+
+void assigned_device_pci_cap_write_config(PCIDevice *pci_dev, uint32_t address,
+  uint32_t val, int len)
+{
+AssignedDevice *assigned_dev = container_of(pci_dev, AssignedDevice, dev);
+uint32_t pos = pci_dev-cap.start;
+uint8_t target_byte, target_position;
+
+pci_default_cap_write_config(pci_dev, address, val, len);
+#ifdef KVM_CAP_DEVICE_MSI
+/* Check if guest want to enable MSI */
+if (assigned_dev-cap.available  ASSIGNED_DEVICE_CAP_MSI) {
+target_position = pos + 2;
+if (address = target_position  address + len  target_position) {
+target_byte = (uint8_t)(val  (target_position - address));
+if (target_byte == 1) {
+assigned_dev-cap.enabled |= ASSIGNED_DEVICE_MSI_ENABLED;
+assigned_dev_enable_msi(pci_dev);
+if (!assigned_dev-cap.enabled  ASSIGNED_DEVICE_MSI_ENABLED)
+pci_dev-cap.config[target_position - pos] = 0;
+}
+}
+pos += PCI_CAPABILITY_CONFIG_MSI_LENGTH;
+}
+#endif
+return;
+}
+
+void assigned_device_pci_cap_init(PCIDevice *pci_dev)
+{
+AssignedDevice *dev = container_of(pci_dev, AssignedDevice, dev);
+
+#ifdef KVM_CAP_DEVICE_MSI
+/* Expose MSI capability
+ * MSI capability is the 1st capability in cap.config */
+if (dev-cap.available  ASSIGNED_DEVICE_CAP_MSI) {
+pci_dev-cap.config[0] = 0x5;
+pci_dev-cap.length += PCI_CAPABILITY_CONFIG_MSI_LENGTH;
+}
+#endif
+}
+
 struct PCIDevice *init_assigned_device(AssignedDevInfo *adev, PCIBus *bus)
 {
 int r;
@@ -580,6 +651,11 @@ struct PCIDevice *init_assigned_device(AssignedDevInfo 
*adev, PCIBus *bus)
 dev-h_busnr = adev-bus;
 dev-h_devfn = PCI_DEVFN(adev-dev, adev-func);
 
+if (dev-cap.available)
+pci_enable_capability_support(pci_dev, 0, NULL,
+  assigned_device_pci_cap_write_config,
+   

[PATCH 3/5] Figure out device capability

2008-11-24 Thread Sheng Yang
Try to figure out device capability in update_dev_cap(). Now we are only care
about MSI capability.

The function pci_find_cap_offset original function wrote by Allen for Xen.
Notice the function need root privilege to work. This depends on libpci to work.

Signed-off-by: Allen Kay [EMAIL PROTECTED]
Signed-off-by: Sheng Yang [EMAIL PROTECTED]
---
 qemu/hw/device-assignment.c |   50 +++
 qemu/hw/device-assignment.h |5 
 2 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/qemu/hw/device-assignment.c b/qemu/hw/device-assignment.c
index 786b2f0..d3105bc 100644
--- a/qemu/hw/device-assignment.c
+++ b/qemu/hw/device-assignment.c
@@ -216,6 +216,35 @@ static void assigned_dev_ioport_map(PCIDevice *pci_dev, 
int region_num,
   (r_dev-v_addrs + region_num));
 }
 
+uint8_t pci_find_cap_offset(struct pci_dev *pci_dev, uint8_t cap)
+{
+int id;
+int max_cap = 48;
+int pos = PCI_CAPABILITY_LIST;
+int status;
+
+status = pci_read_byte(pci_dev, PCI_STATUS);
+if ((status  PCI_STATUS_CAP_LIST) == 0)
+return 0;
+
+while (max_cap--) {
+pos = pci_read_byte(pci_dev, pos);
+if (pos  0x40)
+break;
+
+pos = ~3;
+id = pci_read_byte(pci_dev, pos + PCI_CAP_LIST_ID);
+
+if (id == 0xff)
+break;
+if (id == cap)
+return pos;
+
+pos += PCI_CAP_LIST_NEXT;
+}
+return 0;
+}
+
 static void assigned_dev_pci_write_config(PCIDevice *d, uint32_t address,
   uint32_t val, int len)
 {
@@ -367,6 +396,25 @@ static int assigned_dev_register_regions(PCIRegion 
*io_regions,
 return 0;
 }
 
+static void update_dev_cap(AssignedDevice *pci_dev, uint8_t r_bus,
+   uint8_t r_dev, uint8_t r_func)
+{
+#ifdef KVM_CAP_DEVICE_MSI
+struct pci_access *pacc;
+struct pci_dev *pdev;
+int r;
+
+pacc = pci_alloc();
+pci_init(pacc);
+pdev = pci_get_dev(pacc, 0, r_bus, r_dev, r_func);
+pci_cleanup(pacc);
+r = pci_find_cap_offset(pdev, PCI_CAP_ID_MSI);
+if (r)
+pci_dev-cap.available |= ASSIGNED_DEVICE_CAP_MSI;
+pci_free_dev(pdev);
+#endif
+}
+
 static int get_real_device(AssignedDevice *pci_dev, uint8_t r_bus,
uint8_t r_dev, uint8_t r_func)
 {
@@ -436,6 +484,8 @@ again:
 fclose(f);
 
 dev-region_number = r;
+
+update_dev_cap(pci_dev, r_bus, r_dev, r_func);
 return 0;
 }
 
diff --git a/qemu/hw/device-assignment.h b/qemu/hw/device-assignment.h
index d6caa67..de60988 100644
--- a/qemu/hw/device-assignment.h
+++ b/qemu/hw/device-assignment.h
@@ -29,6 +29,7 @@
 #define __DEVICE_ASSIGNMENT_H__
 
 #include sys/mman.h
+#include pci/pci.h
 #include qemu-common.h
 #include sys-queue.h
 #include pci.h
@@ -80,6 +81,10 @@ typedef struct {
 unsigned char h_busnr;
 unsigned int h_devfn;
 int bound;
+struct {
+#define ASSIGNED_DEVICE_CAP_MSI (1  0)
+int available;
+} cap;
 } AssignedDevice;
 
 typedef struct AssignedDevInfo AssignedDevInfo;
-- 
1.5.4.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] always assign userspace_addr

2008-11-24 Thread Glauber Costa

 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index b1953ee..f605bba 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -735,11 +735,17 @@ int __kvm_set_memory_region(struct kvm *kvm,
base_gfn = mem-guest_phys_addr  PAGE_SHIFT;
npages = mem-memory_size  PAGE_SHIFT;

 -   if (!npages)
 -   mem-flags = ~KVM_MEM_LOG_DIRTY_PAGES;
 -
new = old = *memslot;

 +if (!npages) {
 +mem-flags = ~KVM_MEM_LOG_DIRTY_PAGES;
 +kvm_arch_flush_shadow(kvm);
 +kvm_free_physmem_slot(memslot, NULL);
 +kvm_arch_set_memory_region(kvm, mem, old, user_alloc);
 +goto out;
 +}
 +
 +
new.base_gfn = base_gfn;

Any comments about this version? In the absense of it, I'll submit a
version with a SoB for inclusion.
new.npages = npages;
new.flags = mem-flags;
 @@ -812,9 +818,6 @@ int __kvm_set_memory_region(struct kvm *kvm,
}
  #endif /* not defined CONFIG_S390 */

 -   if (!npages)
 -   kvm_arch_flush_shadow(kvm);
 -
spin_lock(kvm-mmu_lock);
if (mem-slot = kvm-nmemslots)
kvm-nmemslots = mem-slot + 1;





-- 
Glauber  Costa.
Free as in Freedom
http://glommer.net

The less confident you are, the more serious you have to act.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Prevent trace call into unloaded module text

2008-11-24 Thread Wu Fengguang
Add marker_synchronize_unregister() before module unloading.
This prevents possible trace calls into unloaded module text.

Signed-off-by: Wu Fengguang [EMAIL PROTECTED]
---
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a87f45e..64f38b3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2102,5 +2102,6 @@ void kvm_exit(void)
kvm_arch_exit();
kvm_exit_debug();
__free_page(bad_page);
+   marker_synchronize_unregister();
 }
 EXPORT_SYMBOL_GPL(kvm_exit);
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is this a bug in qemu-img?

2008-11-24 Thread walt

walt wrote:


...
BTW, I've been through the same steps twice and
get the same results, so I don't think it's flakey
hardware. OTOH today is a new day, so I'll try it
again to triple check.


Tried again all the way from the beginning and got
the same result.  The commit step is where things go
wrong every time.

I know qcow2 is not considered quite ready for prime
time, but having that commit feature is important to
me so I'd love to see it work correctly.

Any chance that 'commit' could be added to raw as
well as qcow2?

Thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 2.6.27.5 guest boot failure using in-kernel PIT

2008-11-24 Thread Glauber Costa
On Fri, Nov 21, 2008 at 3:10 PM, Marcelo Tosatti [EMAIL PROTECTED] wrote:
 Hi Jan,

 On Fri, Nov 21, 2008 at 08:54:56AM +0100, Jan Kiszka wrote:
 Eduardo Habkost wrote:
  On Thu, Nov 20, 2008 at 12:22:53PM -0200, Eduardo Habkost wrote:
  Hi,
 
  When using a kvm.git kernel as host, I am getting guest boot failures
  when booting Fedora Rawhide kernel (2.6.27.5-117.fc10.x86_64). Guest
  stops booting at:
 
  ENABLING IO-APIC IRQs
  ..TIMER: vector=0x30 apic1=0 pin1=0 apic2=-1 pin2=-1
  ..MP-BIOS bug: 8254 timer not connected to IO-APIC
  ...trying to set up timer (IRQ0) through the 8259A ...
  . (found apic 0 pin 0) ...
  ... failed.
  ...trying to set up timer as Virtual Wire IRQ...
  . failed.
  ...trying to set up timer as ExtINT IRQ...
 
  I've just found out this problem happens because the guest has HZ=1000
  and the host had HZ=250 and no CONFIG_HIGH_RES_TIMERS.
 
  With this setup, the host is not managing to inject enough timer
  interrupts during the mdelay() loop on timer_irq_works().
 

 Interesting, and plausible.

 My observation so far is a sporadic test failure, often correlating with
 some raised host OS load. I'm running a high-res kernel, but that cannot
 prevent that this only 10 ticks long loop of the guest may obtain too
 few CPU cycles to handle enough of them once in a while (IIRC, it needs
 4 out of the 10 ticks to declare the timer routing functional).

 Using in-kernel PIT?

 This is a potential problem which can be worked around by disabling the
 whole thing either via no_timer_check or paravirt equivalent (Glauber?)
 but for the non-paravirt case it seems its not the culprit. Possible
 failure scenarios:

For KVM_CLOCK case, I believe there's absolutely no reason to be more
complicated than than that:

+extern int no_timer_check;
+
 void __init kvmclock_init(void)
 {
if (!kvm_para_available())
@@ -178,6 +180,8 @@ void __init kvmclock_init(void)
if (kvmclock  kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE)) {
if (kvm_register_clock(boot clock))
return;
+
+   no_timer_check = 1; 
pv_time_ops.get_wallclock = kvm_get_wallclock;
pv_time_ops.set_wallclock = kvm_set_wallclock;
pv_time_ops.sched_clock = kvm_clock_read;
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/12] x86: disable virt on kdump and emergency_restart (v4)

2008-11-24 Thread Eduardo Habkost
On Fri, Nov 21, 2008 at 06:07:36PM +0200, Avi Kivity wrote:
snip

 Eduardo, please check the merge (there was a small conflict in reboot.c  
 which I fixed) once I push it.  Also, when generating patches that move  
 files, use the -M switch: this makes it easier to review, and also  
 handles files that change better.

The merge looks ok. I didn't know about -M, I will use it next time.

Thanks!

-- 
Eduardo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: MMU: optimize set_spte for page sync

2008-11-24 Thread Marcelo Tosatti
On Sun, Nov 23, 2008 at 12:36:29PM +0200, Avi Kivity wrote:
 Marcelo Tosatti wrote:

 The cost of hash table and memslot lookups are quite significant if the
 workload is pagetable write intensive resulting in increased mmu_lock
 contention.

 @@ -1593,7 +1593,16 @@ static int set_spte(struct kvm_vcpu *vcp
  spte |= PT_WRITABLE_MASK;
  -   if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {
 +/*
 + * Optimization: for pte sync, if spte was writable the hash
 + * lookup is unnecessary (and expensive). Write protection
 + * is responsibility of mmu_get_page / kvm_sync_page.
 + * Same reasoning can be applied to dirty page accounting.
 + */
 +if (sync_page  is_writeble_pte(*shadow_pte))
 +goto set_pte;
   

 What if *shadow_pte points at a different page?  Is that possible?

To a different gfn? Then sync_page will have nuked the spte:

if (gpte_to_gfn(gpte) != gfn || !is_present_pte(gpte) ||
!(gpte  PT_ACCESSED_MASK)) {
u64 nonpresent;
..
set_shadow_pte(sp-spt[i], nonpresent);
}

Otherwise:

/*
 * Using the cached information from sp-gfns is safe because:
 * - The spte has a reference to the struct page, so the pfn for a given
 * gfn can't change unless all sptes pointing to it are nuked first.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFT] Rebased gdb/debug register patches

2008-11-24 Thread Jan Kiszka
Hi,

this is not yet the official submission, but a request for testing:

I'm happy to announce the availability of a rebased patch series to
enhance KVM's guest debugging support as well as to add debug register
emulation. It was rebased because QEMU mainline recently accepted the
core of my corresponding bits and KVM has merged them over. A few
patches are still awaiting QEMU merge, and two of them are mandatory to
provide a clean foundation for the KVM changes - therefore this
intermediate step.

To test the series, checkout the kernel bits from

git://git.kiszka.org/linux-kvm.git gdb-queue

and the user space part from

git://git.kiszka.org/kvm-userspace.git gdb-queue

Early feedback welcome, also before the final submission. And if someone
could look into AMD/SVM implementation, this would also be great
(unfortunately, there is no customer need for it ATM, thus no resources).

Enjoy,
Jan

-- 
Siemens AG, Corporate Technology, CT SE 2 ES-OS
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm-userspace: Cleanup user space NMI injection

2008-11-24 Thread Jan Kiszka
Cleanup redundant check for an open NMI window before injecting. This
will no longer be supported by the kernel, and it was broken by design
anyway.

This change still allows to run the user space against older kernel
modules.

Signed-off-by: Jan Kiszka [EMAIL PROTECTED]
---

 libkvm/libkvm.c |   20 +++-
 libkvm/libkvm.h |   13 +
 qemu/qemu-kvm-x86.c |   16 ++--
 qemu/qemu-kvm.c |6 +++---
 qemu/qemu-kvm.h |2 +-
 user/main.c |5 ++---
 6 files changed, 16 insertions(+), 46 deletions(-)

diff --git a/libkvm/libkvm.c b/libkvm/libkvm.c
index f6948f5..40c95ce 100644
--- a/libkvm/libkvm.c
+++ b/libkvm/libkvm.c
@@ -832,9 +832,9 @@ int try_push_interrupts(kvm_context_t kvm)
return kvm-callbacks-try_push_interrupts(kvm-opaque);
 }
 
-int try_push_nmi(kvm_context_t kvm)
+void push_nmi(kvm_context_t kvm)
 {
-   return kvm-callbacks-try_push_nmi(kvm-opaque);
+   kvm-callbacks-push_nmi(kvm-opaque);
 }
 
 void post_kvm_run(kvm_context_t kvm, void *env)
@@ -861,17 +861,6 @@ int kvm_is_ready_for_interrupt_injection(kvm_context_t 
kvm, int vcpu)
return run-ready_for_interrupt_injection;
 }
 
-int kvm_is_ready_for_nmi_injection(kvm_context_t kvm, int vcpu)
-{
-#ifdef KVM_CAP_NMI
-   struct kvm_run *run = kvm-run[vcpu];
-
-   return run-ready_for_nmi_injection;
-#else
-   return 0;
-#endif
-}
-
 int kvm_run(kvm_context_t kvm, int vcpu, void *env)
 {
int r;
@@ -880,7 +869,7 @@ int kvm_run(kvm_context_t kvm, int vcpu, void *env)
 
 again:
 #ifdef KVM_CAP_NMI
-   run-request_nmi_window = try_push_nmi(kvm);
+   push_nmi(kvm);
 #endif
 #if !defined(__s390__)
if (!kvm-irqchip_in_kernel)
@@ -957,9 +946,6 @@ again:
r = handle_halt(kvm, vcpu);
break;
case KVM_EXIT_IRQ_WINDOW_OPEN:
-#ifdef KVM_CAP_NMI
-   case KVM_EXIT_NMI_WINDOW_OPEN:
-#endif
break;
case KVM_EXIT_SHUTDOWN:
r = handle_shutdown(kvm, env);
diff --git a/libkvm/libkvm.h b/libkvm/libkvm.h
index aae9f03..aaad4fb 100644
--- a/libkvm/libkvm.h
+++ b/libkvm/libkvm.h
@@ -66,7 +66,7 @@ struct kvm_callbacks {
 int (*shutdown)(void *opaque, void *env);
 int (*io_window)(void *opaque);
 int (*try_push_interrupts)(void *opaque);
-int (*try_push_nmi)(void *opaque);
+void (*push_nmi)(void *opaque);
 void (*post_kvm_run)(void *opaque, void *env);
 int (*pre_kvm_run)(void *opaque, void *env);
 int (*tpr_access)(void *opaque, int vcpu, uint64_t rip, int is_write);
@@ -217,17 +217,6 @@ uint64_t kvm_get_apic_base(kvm_context_t kvm, int vcpu);
 int kvm_is_ready_for_interrupt_injection(kvm_context_t kvm, int vcpu);
 
 /*!
- * \brief Check if a vcpu is ready for NMI injection
- *
- * This checks if vcpu is not already running in NMI context.
- *
- * \param kvm Pointer to the current kvm_context
- * \param vcpu Which virtual CPU should get dumped
- * \return boolean indicating NMI injection readiness
- */
-int kvm_is_ready_for_nmi_injection(kvm_context_t kvm, int vcpu);
-
-/*!
  * \brief Read VCPU registers
  *
  * This gets the GP registers from the VCPU and outputs them
diff --git a/qemu/qemu-kvm-x86.c b/qemu/qemu-kvm-x86.c
index a4ae7ed..671b5b3 100644
--- a/qemu/qemu-kvm-x86.c
+++ b/qemu/qemu-kvm-x86.c
@@ -667,22 +667,18 @@ int kvm_arch_try_push_interrupts(void *opaque)
 return (env-interrupt_request  CPU_INTERRUPT_HARD) != 0;
 }
 
-int kvm_arch_try_push_nmi(void *opaque)
+void kvm_arch_push_nmi(void *opaque)
 {
 CPUState *env = cpu_single_env;
 int r;
 
 if (likely(!(env-interrupt_request  CPU_INTERRUPT_NMI)))
-return 0;
-
-if (kvm_is_ready_for_nmi_injection(kvm_context, env-cpu_index)) {
-env-interrupt_request = ~CPU_INTERRUPT_NMI;
-r = kvm_inject_nmi(kvm_context, env-cpu_index);
-if (r  0)
-printf(cpu %d fail inject NMI\n, env-cpu_index);
-}
+return;
 
-return (env-interrupt_request  CPU_INTERRUPT_NMI) != 0;
+env-interrupt_request = ~CPU_INTERRUPT_NMI;
+r = kvm_inject_nmi(kvm_context, env-cpu_index);
+if (r  0)
+printf(cpu %d fail inject NMI\n, env-cpu_index);
 }
 
 void kvm_arch_update_regs_for_sipi(CPUState *env)
diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c
index 8b4cdd6..cf0e85d 100644
--- a/qemu/qemu-kvm.c
+++ b/qemu/qemu-kvm.c
@@ -154,9 +154,9 @@ static int try_push_interrupts(void *opaque)
 return kvm_arch_try_push_interrupts(opaque);
 }
 
-static int try_push_nmi(void *opaque)
+static void push_nmi(void *opaque)
 {
-return kvm_arch_try_push_nmi(opaque);
+kvm_arch_push_nmi(opaque);
 }
 
 static void post_kvm_run(void *opaque, void *data)
@@ -742,7 +742,7 @@ static struct kvm_callbacks qemu_kvm_ops = {
 .shutdown = kvm_shutdown,
 .io_window = kvm_io_window,
 .try_push_interrupts = try_push_interrupts,
-.try_push_nmi = try_push_nmi,
+.push_nmi = 

[ kvm-Bugs-2327497 ] NFS copy makes guest network unstable

2008-11-24 Thread SourceForge.net
Bugs item #2327497, was opened at 2008-11-22 17:53
Message generated for change (Comment added) made by avik
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2327497group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: Jiajun Xu (jiajun)
Assigned to: Nobody/Anonymous (nobody)
Summary: NFS copy makes guest network unstable

Initial Comment:
The NFS network of KVM guest is very unstable. When we copy a 600M file to the 
guest by NFS mount. The guest's network will down after finishing at about 500M 
size. 
Then, guest's network is down. Host also can not use ping or scp. And 
sometimes, host also complains: ping: sendmsg: No buffer space available. I see 
memory by 'free', there is only 69MB free (While totally 8GB on the machine!).

Using scp to copy file can not reproduce it. This issue is very easy to be 
reproduced (50%). 


Reproduce steps:

1. Create a guest and config NFS sharing folder on it
2. Mount the nfs folder to local folder --- /media
3. cp xxx /media
4. After some time, guest network is down


--

Comment By: Avi Kivity (avik)
Date: 2008-11-24 17:31

Message:
Seems to be a bug in the 8139too driver.  Please try with the 8139cp driver
(which has much better performance).

--

Comment By: Jiajun Xu (jiajun)
Date: 2008-11-24 09:01

Message:
We did not test such case before.
I think the issue also exists before.

--

Comment By: Avi Kivity (avik)
Date: 2008-11-23 23:13

Message:
It's almost certainly a problem with the qemu process, not the bridge.

--

Comment By: Fabio Coatti (cova)
Date: 2008-11-23 22:15

Message:
I can't find out easily wich kvm version worked (nor be sure that is kvm
executable itself to have issues), as the subsystems involved are quite a
lot an some time passet prior to spot the problem. (kvm itself, network
birdge, host kernel may be involved, of course). Now I'm trying to find out
the combination that worked, but at the same time I'll be willing to do
some tests to discover (on the actual non working setup) some hints, as the
bisection can be a very daunting task. (this issue has been noticed after
several upgrades).


--

Comment By: Avi Kivity (avik)
Date: 2008-11-23 20:40

Message:
Is this a regression, or a new test?

It it is a regression, what was the last version that worked?

--

Comment By: Fabio Coatti (cova)
Date: 2008-11-23 17:12

Message:
I can confirm a similar behaviour: a kvm machines gets large amounts of
data via http protocol and saves that files over NFS. (file sizes are in
the range of 4-20 MB approx and the machine downloads several of that
files.) After some time (I don't have a precise figure, but some hundreds
of MB) the guest nework goes down. No answers even to ping coming from
outside.
the guest uses virtio network drivers (as normal drivers are way too
slow)
host machine: 64 bit AMD dual quad core 16GB, tried with several kernels
ranging from 2.6.27.4 to 2.6.25.19
guest: 32 bit kvm machines (tried 76/77/78 ). both UP and SMP
configuration. kernels: same as host machine
network setup:
bridged network with br0 device on host machine. We are using 2 vlans for
guest and we have tried all the configuration (single tap and vlans
resolved on guest side,then two tap so two interfaces on guest machine and
so on) without any improvement. I can exclude MTU issues, as we have seen
that and solved, this issue is completely different.
At some point, sniffing traffic on host interfaces we are able to see only
ARP requests coming from guest, nothing more.

I understand that data is in no way complete, but I'm willing to do any
debug if someone gives me any hint on how to do so correctly. Thanks.


--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2327497group_id=180599
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ kvm-Bugs-2327497 ] NFS copy makes guest network unstable

2008-11-24 Thread SourceForge.net
Bugs item #2327497, was opened at 2008-11-22 17:53
Message generated for change (Settings changed) made by avik
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2327497group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Pending
Resolution: None
Priority: 5
Private: No
Submitted By: Jiajun Xu (jiajun)
Assigned to: Nobody/Anonymous (nobody)
Summary: NFS copy makes guest network unstable

Initial Comment:
The NFS network of KVM guest is very unstable. When we copy a 600M file to the 
guest by NFS mount. The guest's network will down after finishing at about 500M 
size. 
Then, guest's network is down. Host also can not use ping or scp. And 
sometimes, host also complains: ping: sendmsg: No buffer space available. I see 
memory by 'free', there is only 69MB free (While totally 8GB on the machine!).

Using scp to copy file can not reproduce it. This issue is very easy to be 
reproduced (50%). 


Reproduce steps:

1. Create a guest and config NFS sharing folder on it
2. Mount the nfs folder to local folder --- /media
3. cp xxx /media
4. After some time, guest network is down


--

Comment By: Avi Kivity (avik)
Date: 2008-11-24 17:31

Message:
Seems to be a bug in the 8139too driver.  Please try with the 8139cp driver
(which has much better performance).

--

Comment By: Jiajun Xu (jiajun)
Date: 2008-11-24 09:01

Message:
We did not test such case before.
I think the issue also exists before.

--

Comment By: Avi Kivity (avik)
Date: 2008-11-23 23:13

Message:
It's almost certainly a problem with the qemu process, not the bridge.

--

Comment By: Fabio Coatti (cova)
Date: 2008-11-23 22:15

Message:
I can't find out easily wich kvm version worked (nor be sure that is kvm
executable itself to have issues), as the subsystems involved are quite a
lot an some time passet prior to spot the problem. (kvm itself, network
birdge, host kernel may be involved, of course). Now I'm trying to find out
the combination that worked, but at the same time I'll be willing to do
some tests to discover (on the actual non working setup) some hints, as the
bisection can be a very daunting task. (this issue has been noticed after
several upgrades).


--

Comment By: Avi Kivity (avik)
Date: 2008-11-23 20:40

Message:
Is this a regression, or a new test?

It it is a regression, what was the last version that worked?

--

Comment By: Fabio Coatti (cova)
Date: 2008-11-23 17:12

Message:
I can confirm a similar behaviour: a kvm machines gets large amounts of
data via http protocol and saves that files over NFS. (file sizes are in
the range of 4-20 MB approx and the machine downloads several of that
files.) After some time (I don't have a precise figure, but some hundreds
of MB) the guest nework goes down. No answers even to ping coming from
outside.
the guest uses virtio network drivers (as normal drivers are way too
slow)
host machine: 64 bit AMD dual quad core 16GB, tried with several kernels
ranging from 2.6.27.4 to 2.6.25.19
guest: 32 bit kvm machines (tried 76/77/78 ). both UP and SMP
configuration. kernels: same as host machine
network setup:
bridged network with br0 device on host machine. We are using 2 vlans for
guest and we have tried all the configuration (single tap and vlans
resolved on guest side,then two tap so two interfaces on guest machine and
so on) without any improvement. I can exclude MTU issues, as we have seen
that and solved, this issue is completely different.
At some point, sniffing traffic on host interfaces we are able to see only
ARP requests coming from guest, nothing more.

I understand that data is in no way complete, but I'm willing to do any
debug if someone gives me any hint on how to do so correctly. Thanks.


--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2327497group_id=180599
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[ kvm-Bugs-2327497 ] NFS copy makes guest network unstable

2008-11-24 Thread SourceForge.net
Bugs item #2327497, was opened at 2008-11-22 16:53
Message generated for change (Comment added) made by cova
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2327497group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Pending
Resolution: None
Priority: 5
Private: No
Submitted By: Jiajun Xu (jiajun)
Assigned to: Nobody/Anonymous (nobody)
Summary: NFS copy makes guest network unstable

Initial Comment:
The NFS network of KVM guest is very unstable. When we copy a 600M file to the 
guest by NFS mount. The guest's network will down after finishing at about 500M 
size. 
Then, guest's network is down. Host also can not use ping or scp. And 
sometimes, host also complains: ping: sendmsg: No buffer space available. I see 
memory by 'free', there is only 69MB free (While totally 8GB on the machine!).

Using scp to copy file can not reproduce it. This issue is very easy to be 
reproduced (50%). 


Reproduce steps:

1. Create a guest and config NFS sharing folder on it
2. Mount the nfs folder to local folder --- /media
3. cp xxx /media
4. After some time, guest network is down


--

Comment By: Fabio Coatti (cova)
Date: 2008-11-24 16:47

Message:
I wouldn't be so sure of 8139 culprit. We are seeing this with e1000 and
virtio driver...

--

Comment By: Avi Kivity (avik)
Date: 2008-11-24 16:31

Message:
Seems to be a bug in the 8139too driver.  Please try with the 8139cp driver
(which has much better performance).

--

Comment By: Jiajun Xu (jiajun)
Date: 2008-11-24 08:01

Message:
We did not test such case before.
I think the issue also exists before.

--

Comment By: Avi Kivity (avik)
Date: 2008-11-23 22:13

Message:
It's almost certainly a problem with the qemu process, not the bridge.

--

Comment By: Fabio Coatti (cova)
Date: 2008-11-23 21:15

Message:
I can't find out easily wich kvm version worked (nor be sure that is kvm
executable itself to have issues), as the subsystems involved are quite a
lot an some time passet prior to spot the problem. (kvm itself, network
birdge, host kernel may be involved, of course). Now I'm trying to find out
the combination that worked, but at the same time I'll be willing to do
some tests to discover (on the actual non working setup) some hints, as the
bisection can be a very daunting task. (this issue has been noticed after
several upgrades).


--

Comment By: Avi Kivity (avik)
Date: 2008-11-23 19:40

Message:
Is this a regression, or a new test?

It it is a regression, what was the last version that worked?

--

Comment By: Fabio Coatti (cova)
Date: 2008-11-23 16:12

Message:
I can confirm a similar behaviour: a kvm machines gets large amounts of
data via http protocol and saves that files over NFS. (file sizes are in
the range of 4-20 MB approx and the machine downloads several of that
files.) After some time (I don't have a precise figure, but some hundreds
of MB) the guest nework goes down. No answers even to ping coming from
outside.
the guest uses virtio network drivers (as normal drivers are way too
slow)
host machine: 64 bit AMD dual quad core 16GB, tried with several kernels
ranging from 2.6.27.4 to 2.6.25.19
guest: 32 bit kvm machines (tried 76/77/78 ). both UP and SMP
configuration. kernels: same as host machine
network setup:
bridged network with br0 device on host machine. We are using 2 vlans for
guest and we have tried all the configuration (single tap and vlans
resolved on guest side,then two tap so two interfaces on guest machine and
so on) without any improvement. I can exclude MTU issues, as we have seen
that and solved, this issue is completely different.
At some point, sniffing traffic on host interfaces we are able to see only
ARP requests coming from guest, nothing more.

I understand that data is in no way complete, but I'm willing to do any
debug if someone gives me any hint on how to do so correctly. Thanks.


--

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=2327497group_id=180599
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] really remove a slow when a user ask us so

2008-11-24 Thread Glauber Costa
Right now, KVM does not remove a slot when we do a
register ioctl for size 0 (would be the expected behaviour).

Instead, we only mark it as empty, but keep all bitmaps
and allocated data structures present. It completely
nullifies our chances of reusing that same slot again
for mapping a different piece of memory.

In this patch, we destroy rmaps, and vfree() the
pointers that used to hold the dirty bitmap, rmap
and lpage_info structures.

Signed-off-by: Glauber Costa [EMAIL PROTECTED]
---
 virt/kvm/kvm_main.c |   15 +--
 1 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index b1953ee..f605bba 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -735,11 +735,17 @@ int __kvm_set_memory_region(struct kvm *kvm,
base_gfn = mem-guest_phys_addr  PAGE_SHIFT;
npages = mem-memory_size  PAGE_SHIFT;
 
-   if (!npages)
-   mem-flags = ~KVM_MEM_LOG_DIRTY_PAGES;
-
new = old = *memslot;
 
+if (!npages) {
+mem-flags = ~KVM_MEM_LOG_DIRTY_PAGES;
+kvm_arch_flush_shadow(kvm);
+kvm_free_physmem_slot(memslot, NULL);
+kvm_arch_set_memory_region(kvm, mem, old, user_alloc);
+goto out;
+}
+
+
new.base_gfn = base_gfn;
new.npages = npages;
new.flags = mem-flags;
@@ -812,9 +818,6 @@ int __kvm_set_memory_region(struct kvm *kvm,
}
 #endif /* not defined CONFIG_S390 */
 
-   if (!npages)
-   kvm_arch_flush_shadow(kvm);
-
spin_lock(kvm-mmu_lock);
if (mem-slot = kvm-nmemslots)
kvm-nmemslots = mem-slot + 1;
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] sign kvmclock as paravirt

2008-11-24 Thread Glauber Costa
Currently, we only set the KVM paravirt signature in case
of CONFIG_KVM_GUEST. However, it is possible to have it turned
off, while CONFIG_KVM_CLOCK is turned on. This is also a paravirt
case, and should be shown accordingly.

Signed-off-by: Glauber Costa [EMAIL PROTECTED]
---
 arch/x86/kernel/kvmclock.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 1c9cc43..4a1ee5a 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -194,5 +194,7 @@ void __init kvmclock_init(void)
 #endif
kvm_get_preset_lpj();
clocksource_register(kvm_clock);
+   pv_info.paravirt_enabled = 1;
+   pv_info.name = KVM;
}
 }
-- 
1.5.6.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM: MMU: optimize set_spte for page sync

2008-11-24 Thread Marcelo Tosatti
On Mon, Nov 24, 2008 at 01:04:23PM +0100, Marcelo Tosatti wrote:
 On Sun, Nov 23, 2008 at 12:36:29PM +0200, Avi Kivity wrote:
  Marcelo Tosatti wrote:
 
  The cost of hash table and memslot lookups are quite significant if the
  workload is pagetable write intensive resulting in increased mmu_lock
  contention.
 
  @@ -1593,7 +1593,16 @@ static int set_spte(struct kvm_vcpu *vcp
 spte |= PT_WRITABLE_MASK;
   - if (mmu_need_write_protect(vcpu, gfn, can_unsync)) {
  +  /*
  +   * Optimization: for pte sync, if spte was writable the hash
  +   * lookup is unnecessary (and expensive). Write protection
  +   * is responsibility of mmu_get_page / kvm_sync_page.
  +   * Same reasoning can be applied to dirty page accounting.
  +   */
  +  if (sync_page  is_writeble_pte(*shadow_pte))
  +  goto set_pte;

 
  What if *shadow_pte points at a different page?  Is that possible?

 To a different gfn? Then sync_page will have nuked the spte:
 
 if (gpte_to_gfn(gpte) != gfn || !is_present_pte(gpte) ||
 !(gpte  PT_ACCESSED_MASK)) {
 u64 nonpresent;
 ..
 set_shadow_pte(sp-spt[i], nonpresent);
 }
 
 Otherwise:
 
 /*
  * Using the cached information from sp-gfns is safe because:
  * - The spte has a reference to the struct page, so the pfn for a given
  * gfn can't change unless all sptes pointing to it are nuked first.

*shadow_pte can point to a different page if the guest updates
pagetable, there is a fault before resync, the fault updates the
spte with new gfn (and pfn) via mmu_set_spte. In which case the gfn
cache is updated since:

} else if (pfn != spte_to_pfn(*shadow_pte)) {
printk(hfn old %lx new %lx\n,
 spte_to_pfn(*shadow_pte), pfn);
rmap_remove(vcpu-kvm, shadow_pte);


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


hrtimer_forward() semantics when using non-high-res timers

2008-11-24 Thread Eduardo Habkost
Hi, Thomas,

I've been looking at a timer problem on KVM recently[1] and I've got a
question about the expected semantics of hrtimer_forward().

The problem I am looking at is related to having proper accouting of
missed ticks on the KVM timer code when it the host has lost timer
ticks because of high CPU load, or because it doesn't have hrtimers
enabled. hrtimer_forward_now() overrun accounting looked perfect for
the task of checking how many ticks we have lost.

However hrtimer_forward() limits the interval parameter to the timer
resolution, making it useless for calculating how many timer periods we've
lost because of too-low timer resolution. I am even a bit surprised no
other code needs a hrtimer_forward-like function for that, yet.

For example: if we want to account for a tick every 1 ms and the host
has HZ=250 and no high-resolution timers, calling hrtimer_forward_now()
on every timer tick will normally return 1 because it will count how
many 4 ms periods were added to the timer expiration time. However,
I would like to calculate how many 1 ms periods I've lost, no matter
what the real timer resolution is.

I could do my own missed-ticks calculation, but the hrtimer_forward()
logic would be perfect for my needs if it didn't have the resolution check
code, and I don't feel like duplicating part of hrtimer_forward(). Do you
think it would make sense to have on the timers API a hrtimer_forward-like
function that doesn't have the interval lower-limit?


[1] http://marc.info/?l=kvmm=122728725028262w=2

-- 
Eduardo
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: ppc: stop leaking host memory on VM exit

2008-11-24 Thread Hollis Blanchard
When the VM exits, we must call put_page() for every page referenced in the
shadow TLB.

Without this patch, we usually leak 30-50 host pages (120 - 200 KiB with 4 KiB
pages). The maximum number of pages leaked is the size of our shadow TLB, 64
pages.

Signed-off-by: Hollis Blanchard [EMAIL PROTECTED]
---
The obvious question is why didn't we see this before? Basically, we'd never
looked for it, and since most of our work was in the kernel we always ended up
rebooting before exhausting host memory.

Since it's such a large leak, and a simple fix, please commit this for 2.6.28.
This patch does apply to kvm.git with fuzz, but if you prefer I can send a
separate patch for that later.

diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -104,4 +104,6 @@ static inline void kvmppc_set_pid(struct
}
 }
 
+extern void kvmppc_core_destroy_mmu(struct kvm_vcpu *vcpu);
+
 #endif /* __POWERPC_KVM_PPC_H__ */
diff --git a/arch/powerpc/kvm/44x_tlb.c b/arch/powerpc/kvm/44x_tlb.c
--- a/arch/powerpc/kvm/44x_tlb.c
+++ b/arch/powerpc/kvm/44x_tlb.c
@@ -124,6 +124,14 @@ static void kvmppc_44x_shadow_release(st
}
 }
 
+void kvmppc_core_destroy_mmu(struct kvm_vcpu *vcpu)
+{
+   int i;
+
+   for (i = 0; i = tlb_44x_hwater; i++)
+   kvmppc_44x_shadow_release(vcpu, i);
+}
+
 void kvmppc_tlbe_set_modified(struct kvm_vcpu *vcpu, unsigned int i)
 {
 vcpu-arch.shadow_tlb_mod[i] = 1;
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -238,6 +238,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *
 
 void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
 {
+   kvmppc_core_destroy_mmu(vcpu);
 }
 
 /* Note: clearing MSR[DE] just means that the debug interrupt will not be

-- 
Hollis Blanchard
IBM Linux Technology Center

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Direct kernel boot without harddrive image

2008-11-24 Thread Daire Byrne

- Daire Byrne [EMAIL PROTECTED] wrote:

 I tried with -no-kvm and I get the same crash when I reboot the VM. I
 suppose it's a qemu bug then. I tried with the latest kvm-qemu (78)
 but perhaps I should try the latest Qemu and if it still breaks report
 the bug on the Qemu mailing list? It is like it forgets to boot the
 kernel and initrd again after a reboot and tries to boot from the
 harddrive instead.

More weirdness with direct booting - using more than 2048MB causes the BIOS to 
repeatedly crash out. This only happens using -kernel and -initrd.

Daire
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


gettimeofday slow in RHEL4 guests

2008-11-24 Thread David S. Ahern

I noticed that gettimeofday in RHEL4.6 guests is taking much longer than
with RHEL3.8 guests. I wrote a simple program (see below) to call
gettimeofday in a loop 1,000,000 times and then used time to measure how
long it took.


For the RHEL3.8 guest:
time -p ./timeofday_bench
real 0.99
user 0.12
sys 0.24

For the RHEL4.6 guest with the default clock source (pmtmr):
time -p ./timeofday_bench
real 15.65
user 0.18
sys 15.46

and RHEL4.6 guest with PIT as the clock source (clock=pit kernel parameter):
time -p ./timeofday_bench
real 13.67
user 0.21
sys 13.45

So, basically gettimeofday() takes about 50 times as long on a RHEL4 guest.

Host is a DL380G5, 2 dual-core Xeon 5140 processors, 4 GB of RAM. It's
running kvm.git tree as of 11/18/08 with kvm-75 userspace. Guest in both
RHEL3 and RHEL4 cases has 4 vcpus, 3.5GB of RAM.

david

--

timeofday_bench.c:

#include sys/time.h
#include stdio.h
#include stdlib.h

int main(int argc, char *argv[])
{
int rc = 0, n;
struct timeval tv;
int iter = 100;  /* number of times to call gettimeofday */

if (argc  1)
iter = atoi(argv[1]);

if (iter == 0) {
fprintf(stderr, invalid number of iterations\n);
return 1;
}

printf(starting );
for (n = 0; n  iter; ++n) {
if (gettimeofday(tv, NULL) != 0) {
fprintf(stderr, \ngettimeofday failed\n);
rc = 1;
break;
}
}

if (!rc)
printf(done\n);

return rc;
}
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is this a bug in qemu-img?

2008-11-24 Thread Charles Duffy

walt wrote:

Any chance that 'commit' could be added to raw as
well as qcow2?


Raw images by their nature can't contain metadata -- they have only the 
exact contents of the virtual drive, which is what makes them raw -- 
so they by definition can't support copy-on-write (and thus commit) or 
other functionality requiring metadata.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] Enable Pass Through Feature in Intel IOMMU

2008-11-24 Thread Fenghua Yu

The patch set adds kernel parameter intel_iommu=pt to set up pass through mode 
in
context mapping entry. This disables DMAR in linux kernel; but KVM still runs on
VT-d. In this mode, kernel uses swiotlb for DMA API functions but other VT-d 
functionalities are enabled for KVM. KVM always uses multi level translation
page table in VT-d. By default, pass though mode is disabled in kernel.

This is useful when people don't want to enable VT-d DMAR in kernel for
reasons like kernel iommu performance concern or debug purpose but still want to
use KVM.

Thanks.

-Fenghua


Signed-off-by: Fenghua Yu [EMAIL PROTECTED]
Signed-off-by: Weidong Han [EMAIL PROTECTED]
Signed-off-by: Allen Kay [EMAIL PROTECTED]
Signed-off-by: David Woodhouse [EMAIL PROTECTED]

---

 Documentation/kernel-parameters.txt |5 +++
 arch/ia64/include/asm/iommu.h   |1 
 arch/ia64/kernel/pci-swiotlb.c  |2 -
 arch/x86/include/asm/iommu.h|1 
 arch/x86/kernel/pci-swiotlb_64.c|4 ++-
 drivers/pci/intel-iommu.c   |   47 ++--
 include/linux/dma_remapping.h   |3 ++
 include/linux/intel-iommu.h |3 +-
 8 files changed, 50 insertions(+), 16 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index e0f346d..b966185 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -931,6 +931,11 @@ and is between 256 and 4096 characters. It is defined in 
the file
With this option on every unmap_single operation will
result in a hardware IOTLB flush operation as opposed
to batching them for performance.
+   pt  [Default no Pass Through]
+   This option enables Pass Through in context mapping if
+   Pass Through is supported in hardware. With this option
+   DMAR is disabled in kernel and kernel uses swiotlb, but
+   KVM still uses VT-d hardware.
 
io_delay=   [X86-32,X86-64] I/O delay method
0x80
diff --git a/arch/ia64/include/asm/iommu.h b/arch/ia64/include/asm/iommu.h
index 0490794..37d41ca 100644
--- a/arch/ia64/include/asm/iommu.h
+++ b/arch/ia64/include/asm/iommu.h
@@ -9,6 +9,7 @@ extern void pci_iommu_shutdown(void);
 extern void no_iommu_init(void);
 extern int force_iommu, no_iommu;
 extern int iommu_detected;
+extern int iommu_pass_through;
 extern void iommu_dma_init(void);
 extern void machvec_init(const char *name);
 
diff --git a/arch/ia64/kernel/pci-swiotlb.c b/arch/ia64/kernel/pci-swiotlb.c
index 16c5051..69135b0 100644
--- a/arch/ia64/kernel/pci-swiotlb.c
+++ b/arch/ia64/kernel/pci-swiotlb.c
@@ -32,7 +32,7 @@ struct dma_mapping_ops swiotlb_dma_ops = {
 
 void __init pci_swiotlb_init(void)
 {
-   if (!iommu_detected) {
+   if (!iommu_detected || iommu_pass_through) {
 #ifdef CONFIG_IA64_GENERIC
swiotlb = 1;
printk(KERN_INFO PCI-DMA: Re-initialize machine vector.\n);
diff --git a/arch/x86/include/asm/iommu.h b/arch/x86/include/asm/iommu.h
index 0b500c5..014e94f 100644
--- a/arch/x86/include/asm/iommu.h
+++ b/arch/x86/include/asm/iommu.h
@@ -6,6 +6,7 @@ extern void no_iommu_init(void);
 extern struct dma_mapping_ops nommu_dma_ops;
 extern int force_iommu, no_iommu;
 extern int iommu_detected;
+extern int iommu_pass_through;
 
 extern unsigned long iommu_nr_pages(unsigned long addr, unsigned long len);
 
diff --git a/arch/x86/kernel/pci-swiotlb_64.c b/arch/x86/kernel/pci-swiotlb_64.c
index 3c539d1..4af2425 100644
--- a/arch/x86/kernel/pci-swiotlb_64.c
+++ b/arch/x86/kernel/pci-swiotlb_64.c
@@ -50,8 +50,10 @@ struct dma_mapping_ops swiotlb_dma_ops = {
 void __init pci_swiotlb_init(void)
 {
/* don't initialize swiotlb if iommu=off (no_iommu=1) */
-   if (!iommu_detected  !no_iommu  max_pfn  MAX_DMA32_PFN)
+   if ((!iommu_detected  !no_iommu  max_pfn  MAX_DMA32_PFN) ||
+   iommu_pass_through)
   swiotlb = 1;
+
if (swiotlb_force)
swiotlb = 1;
if (swiotlb) {
diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index aec60ad..f164a3c 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -120,7 +120,6 @@ struct context_entry {
(c).lo = (((u64)-1)  4) | 3; \
(c).lo |= ((val)  3)  2; \
} while (0)
-#define CONTEXT_TT_MULTI_LEVEL 0
 #define context_set_address_root(c, val) \
do {(c).lo |= (val)  VTD_PAGE_MASK; } while (0)
 #define context_set_address_width(c, val) do {(c).hi |= (val)  7;} while (0)
@@ -203,6 +202,7 @@ static long list_size;
 static void domain_remove_dev_info(struct dmar_domain *domain);
 
 int dmar_disabled;
+int iommu_pass_through;
 static int __initdata dmar_map_gfx = 1;
 static int dmar_forcedac;
 static int intel_iommu_strict;
@@ -231,6 +231,9 @@ static int __init 

[PATCH 2/2] Enable Pass Through Feature in Intel IOMMU

2008-11-24 Thread Fenghua Yu
The patch set adds kernel parameter intel_iommu=pt to set up pass through mode 
in
context mapping entry. This disables DMAR in linux kernel; but KVM still runs on
VT-d. In this mode, kernel uses swiotlb for DMA API functions but other VT-d
functionalities are enabled for KVM. By default, pass though mode is disabled in
kernel.

This second patch changes context mapping interface called in KVM vtd.c. KVM
always uses multi level translation page table in VT-d.


Signed-off-by: Fenghua Yu [EMAIL PROTECTED]
Signed-off-by: Weidong Han [EMAIL PROTECTED]
Signed-off-by: Allen Kay [EMAIL PROTECTED]
Signed-off-by: David Woodhouse [EMAIL PROTECTED]

---

 vtd.c |2 +-
 1 files changed, 1 insertion(+), 1 deletion(-)


diff --git a/virt/kvm/vtd.c b/virt/kvm/vtd.c
index a770874..7b753d7 100644
--- a/virt/kvm/vtd.c
+++ b/virt/kvm/vtd.c
@@ -124,7 +124,7 @@ int kvm_iommu_map_guest(struct kvm *kvm,
   pdev-bus-number, pdev-devfn);
 
r = intel_iommu_context_mapping(kvm-arch.intel_iommu_domain,
-   pdev);
+   pdev, CONTEXT_TT_MULTI_LEVEL);
if (r) {
printk(KERN_ERR Domain context map for %s failed,
   pci_name(pdev));
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: external module: fix kernel header rsync

2008-11-24 Thread Hollis Blanchard
When the shell encounters a glob it can't expand (like
arch/powerpc/include/asm/vmx*.h), it leaves the raw pattern behind. rsync then
looks for a file named arch/powerpc/include/asm/vmx*.h (without trying to do
its own globbing) and fails.

Fix by using make's $(wildcard) function for the expansion, which does not
leave unexpanded patterns behind.

Signed-off-by: Hollis Blanchard [EMAIL PROTECTED]

diff --git a/kernel/Makefile b/kernel/Makefile
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -57,22 +57,22 @@ header-link:
 
 T = $(subst -sync,,$@)-tmp
 
-headers-old = $(LINUX)/./include/asm-$(ARCH_DIR)/kvm*.h
-headers-new = $(LINUX)/arch/$(ARCH_DIR)/include/asm/./kvm*.h \
+headers-old = $(wildcard $(LINUX)/./include/asm-$(ARCH_DIR)/kvm*.h)
+headers-new = $(wildcard $(LINUX)/arch/$(ARCH_DIR)/include/asm/./kvm*.h \
$(LINUX)/arch/$(ARCH_DIR)/include/asm/./vmx*.h \
$(LINUX)/arch/$(ARCH_DIR)/include/asm/./svm*.h \
-   $(LINUX)/arch/$(ARCH_DIR)/include/asm/./virtext*.h
+   $(LINUX)/arch/$(ARCH_DIR)/include/asm/./virtext*.h)
 
 header-sync:
rm -rf $T
rsync -R \
 $(LINUX)/./include/linux/kvm*.h \
-$(if $(wildcard $(headers-old)), $(headers-old)) \
- $T/
-   $(if $(wildcard $(headers-new)), \
+$(headers-old) \
+$T/
+   $(if $(headers-new), \
rsync -R \
 $(headers-new) \
- $T/include/asm-$(ARCH_DIR)/)
+$T/include/asm-$(ARCH_DIR)/)
 
for i in $$(find $T -name '*.h'); do \
$(call unifdef,$$i); done
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] kvm: ppc: stop leaking host memory on VM exit

2008-11-24 Thread Liu Yu

Good catch.
 
 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Hollis Blanchard
 Sent: Tuesday, November 25, 2008 1:38 AM
 To: Avi Kivity
 Cc: kvm-ppc; kvm
 Subject: [PATCH] kvm: ppc: stop leaking host memory on VM exit
 
 When the VM exits, we must call put_page() for every page 
 referenced in the
 shadow TLB.
 
 Without this patch, we usually leak 30-50 host pages (120 - 
 200 KiB with 4 KiB
 pages). The maximum number of pages leaked is the size of our 
 shadow TLB, 64
 pages.
 
 Signed-off-by: Hollis Blanchard [EMAIL PROTECTED]
 ---
 The obvious question is why didn't we see this before? 
 Basically, we'd never
 looked for it, and since most of our work was in the kernel 
 we always ended up
 rebooting before exhausting host memory.
 
 Since it's such a large leak, and a simple fix, please commit 
 this for 2.6.28.
 This patch does apply to kvm.git with fuzz, but if you prefer 
 I can send a
 separate patch for that later.
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: gettimeofday slow in RHEL4 guests

2008-11-24 Thread David S. Ahern
Some more data on this overhead.

RHEL3 (which is based on the 2.4.21 kernel) gets microsecond resolutions
by reading the TSC. Reading the TSC from within a guest is very fast on kvm.

RHEL4 (which is basd on the 2.6.9 kernel) allows multiple time sources:
pmtmr (ACPI power management timer which is the default), pit, hpet and TSC.

The pmtmr and pit both do ioport reads to get microsecond resolutions
(see read_pmtmr and get_offset_pit, respectively). For the tsc as the
timer source gettimeofday is *very* lightweight, but time drifts very
badly and ntpd cannot acquire a sync. I believe someone is working on
the HPET for guests and I know from bare metal performance that it is a
much lighter weight time source, but with RHEL4 the HPET breaks the
ability to use the RTC. So, I'm running out of options for reliable and
lightweight time sources.

Any chance the pit or pmtmr options can be optimized a bit?

thanks,

david

PS. yes, I did try the userspace pit and its performance is worse than
the in-kernel PIT.


David S. Ahern wrote:
 I noticed that gettimeofday in RHEL4.6 guests is taking much longer than
 with RHEL3.8 guests. I wrote a simple program (see below) to call
 gettimeofday in a loop 1,000,000 times and then used time to measure how
 long it took.
 
 
 For the RHEL3.8 guest:
 time -p ./timeofday_bench
 real 0.99
 user 0.12
 sys 0.24
 
 For the RHEL4.6 guest with the default clock source (pmtmr):
 time -p ./timeofday_bench
 real 15.65
 user 0.18
 sys 15.46
 
 and RHEL4.6 guest with PIT as the clock source (clock=pit kernel parameter):
 time -p ./timeofday_bench
 real 13.67
 user 0.21
 sys 13.45
 
 So, basically gettimeofday() takes about 50 times as long on a RHEL4 guest.
 
 Host is a DL380G5, 2 dual-core Xeon 5140 processors, 4 GB of RAM. It's
 running kvm.git tree as of 11/18/08 with kvm-75 userspace. Guest in both
 RHEL3 and RHEL4 cases has 4 vcpus, 3.5GB of RAM.
 
 david
 
 --
 
 timeofday_bench.c:
 
 #include sys/time.h
 #include stdio.h
 #include stdlib.h
 
 int main(int argc, char *argv[])
 {
   int rc = 0, n;
   struct timeval tv;
   int iter = 100;  /* number of times to call gettimeofday */
 
   if (argc  1)
   iter = atoi(argv[1]);
 
   if (iter == 0) {
   fprintf(stderr, invalid number of iterations\n);
   return 1;
   }
 
   printf(starting );
   for (n = 0; n  iter; ++n) {
   if (gettimeofday(tv, NULL) != 0) {
   fprintf(stderr, \ngettimeofday failed\n);
   rc = 1;
   break;
   }
   }
 
   if (!rc)
   printf(done\n);
 
   return rc;
 }
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: Weekly KVM Test report, kernel 30d95f ... userspace fc94d1 ...

2008-11-24 Thread Xu, Jiajun
On Monday, November 24, 2008 12:57 AM Avi Kivity wrote:

 Xu, Jiajun wrote:
 2. failure to migrate guests with more than 4GB of RAM
 
 https://sourceforge.net/tracker/index.php?func=detailaid=19715
 12group_id=180599atid=893831
 
 
 
 Can you retest this?  I successfully migrated a 5G guest (from a 4G
 host to itself; slo...)/

I tried latest commit, userspace.git 6e63ba19476753595e508713eb9daf559dc50bf6 
with a 64-bit RHEL5.1 Guest. My host kernel is 2.6.26.2. And My host has 8GB 
memory and 4GB swap.
Guest can be live migrated, but after that, guest will call trace.

Maybe we can have a check with each other's environment.

My steps as following:
1. qemu-system-x86_64 -incoming tcp:localhost: -m 4096  -net 
nic,macaddr=00:16:3e:44:1a:a6,model=rtl8139 -net tap,script=/etc/kvm/qemu-ifup 
-hda /share/xvs/var/rhel5u1.img
2. qemu-system-x86_64  -m 4096 -net nic,macaddr=00:16:3e:44:1a:a6,model=rtl8139 
-net tap,script=/etc/kvm/qemu-ifup -hda /share/xvs/var/rhel5u1.img
3. In qemu console, type migrate tcp:localhost:

The call trace messages in guest:
###
Kernel BUG at block/elevator.c:560
invalid opcode:  [1] SMP 
last sysfs file: /block/hda/removable
CPU 0 
Modules linked in: ipv6 autofs4 hidp rfcomm l2cap bluetooth sunrpc iscsi_tcp
ib_iser libiscsi scsi_transport_iscsi rdma_ucm ib_ucm ib_srp ib_sdp rdma_cm
ib_cm iw_cm ib_addr ib_local_sa ib_ipoib ib_sa ib_uverbs ib_umad ib_mad ib_core
dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec i2c_core button
battery asus_acpi acpi_memhotplug ac lp floppy pcspkr serio_raw 8139cp 8139too
parport_pc parport mii ide_cd cdrom ata_piix libata sd_mod scsi_mod ext3 jbd
ehci_hcd ohci_hcd uhci_hcd
Pid: 0, comm: swapper Not tainted 2.6.18-53.el5 #1
RIP: 0010:[80134673]  [80134673]
elv_dequeue_request+0x8/0x3c
RSP: 0018:8040ddc0  EFLAGS: 00010046
RAX: 0001 RBX: 81011381b398 RCX: 
RDX: 81011381b398 RSI: 81011381b398 RDI: 81011fb912c0
RBP: 804abe18 R08: 80304108 R09: 0012
R10: 0022 R11:  R12: 
R13: 0001 R14: 0086 R15: 8040deb8
FS:  () GS:80396000() knlGS:
CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
CR2: 2ad6f4d0 CR3: 0001126cc000 CR4: 06e0
Process swapper (pid: 0, threadinfo 803c6000, task 802dcae0)
Stack:  8000ae3c 804abe18 804abe50 
 804abd00 0246 8003ba73 8003ba0c
 804abe18 81011fbe5800 8000d2a5 81011fb8c5c0
Call Trace:
 IRQ  [8000ae3c] ide_end_request+0xc6/0xfc
 [8003ba73] ide_dma_intr+0x67/0xab
 [8003ba0c] ide_dma_intr+0x0/0xab
 [8000d2a5] ide_intr+0x16f/0x1df
 [800107a0] handle_IRQ_event+0x29/0x58
 [800b5482] __do_IRQ+0xa4/0x105
 [8006a3bd] do_IRQ+0xe7/0xf5
 [8005b615] ret_from_intr+0x0/0xa
 [80011ca9] __do_softirq+0x53/0xd5
 [8005c2fc] call_softirq+0x1c/0x28
 [8006a53a] do_softirq+0x2c/0x85
 [80068d0e] default_idle+0x0/0x50
 [8005bc8e] apic_timer_interrupt+0x66/0x6c
 EOI  [80068d37] default_idle+0x29/0x50
 [80046f8d] cpu_idle+0x95/0xb8
 [803d1806] start_kernel+0x220/0x225
 [803d1237] _sinittext+0x237/0x23e


Code: 0f 0b 68 25 50 29 80 c2 30 02 48 8b 46 08 48 89 42 08 48 89 
RIP  [80134673] elv_dequeue_request+0x8/0x3c
 RSP 8040ddc0
 0Kernel panic - not syncing: Fatal exception
 BUG: warning at kernel/panic.c:137/panic() (Not tainted)

Call Trace:
 IRQ  [8008ccca] panic+0x1e3/0x1f4
 [80196ae8] do_unblank_screen+0x1b/0x132
 [800631aa] oops_end+0x51/0x53
 [80069689] die+0x3a/0x44
 [80069c37] do_invalid_op+0xad/0xb7
 [80134673] elv_dequeue_request+0x8/0x3c
 [80092dd4] do_timer+0x2e8/0x53c
 [8006c0cc] main_timer_handler+0x23d/0x3f4
 [8005bde9] error_exit+0x0/0x84
 [80134673] elv_dequeue_request+0x8/0x3c
 [8000ae3c] ide_end_request+0xc6/0xfc
 [8003ba73] ide_dma_intr+0x67/0xab
 [8003ba0c] ide_dma_intr+0x0/0xab
 [8000d2a5] ide_intr+0x16f/0x1df
 [800107a0] handle_IRQ_event+0x29/0x58
 [800b5482] __do_IRQ+0xa4/0x105
 [8006a3bd] do_IRQ+0xe7/0xf5
 [8005b615] ret_from_intr+0x0/0xa
 [80011ca9] __do_softirq+0x53/0xd5
 [8005c2fc] call_softirq+0x1c/0x28
 [8006a53a] do_softirq+0x2c/0x85
 [80068d0e] default_idle+0x0/0x50
 [8005bc8e] apic_timer_interrupt+0x66/0x6c
 EOI  [80068d37] default_idle+0x29/0x50
 [80046f8d] cpu_idle+0x95/0xb8
 [803d1806] start_kernel+0x220/0x225
 [803d1237] _sinittext+0x237/0x23e

BUG: warning at drivers/input/serio/i8042.c:846/i8042_panic_blink() (Not
tainted)

Call Trace:
 IRQ  [801ee9b8] 

[ kvm-Bugs-1971512 ] failure to migrate guests with more than 4GB of RAM

2008-11-24 Thread SourceForge.net
Bugs item #1971512, was opened at 2008-05-24 14:45
Message generated for change (Comment added) made by jiajun
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detailatid=893831aid=1971512group_id=180599

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: None
Group: None
Status: Pending
Resolution: Fixed
Priority: 3
Private: No
Submitted By: Marcelo Tosatti (mtosatti)
Assigned to: Anthony Liguori (aliguori)
Summary: failure to migrate guests with more than 4GB of RAM

Initial Comment:

The migration code assumes linear phys_ram_base:

[EMAIL PROTECTED] kvm-userspace.tip]# qemu/x86_64-softmmu/qemu-system-x86_64 
-hda /root/images/marcelo5-io-test.img -m 4097 -net nic,model=rtl8139 -net 
tap,script=/root/iptables/ifup -incoming tcp://0:/
audit_log_user_command(): Connection refused
audit_log_user_command(): Connection refused
migration: memory size mismatch: recv 22032384 mine 4316999680
migrate_incoming_fd failed (rc=232)


--

Comment By: Jiajun Xu (jiajun)
Date: 2008-11-24 21:52

Message:
I tried latest commit, userspace.git
6e63ba19476753595e508713eb9daf559dc50bf6 with a 64-bit RHEL5.1 Guest. My
host kernel is 2.6.26.2. And My host has 8GB memory and 4GB swap.
Guest can be live migrated, but after that, guest will call trace.

Maybe we can have a check with each other's environment.

My steps as following:
1. qemu-system-x86_64 -incoming tcp:localhost: -m 4096  -net
nic,macaddr=00:16:3e:44:1a:a6,model=rtl8139 -net
tap,script=/etc/kvm/qemu-ifup -hda /share/xvs/var/rhel5u1.img
2. qemu-system-x86_64  -m 4096 -net
nic,macaddr=00:16:3e:44:1a:a6,model=rtl8139 -net
tap,script=/etc/kvm/qemu-ifup -hda /share/xvs/var/rhel5u1.img
3. In qemu console, type migrate tcp:localhost:

The call trace messages in guest:
###
Kernel BUG at block/elevator.c:560
invalid opcode:  [1] SMP 
last sysfs file: /block/hda/removable
CPU 0 
Modules linked in: ipv6 autofs4 hidp rfcomm l2cap bluetooth sunrpc
iscsi_tcp
ib_iser libiscsi scsi_transport_iscsi rdma_ucm ib_ucm ib_srp ib_sdp
rdma_cm
ib_cm iw_cm ib_addr ib_local_sa ib_ipoib ib_sa ib_uverbs ib_umad ib_mad
ib_core
dm_mirror dm_multipath dm_mod video sbs backlight i2c_ec i2c_core button
battery asus_acpi acpi_memhotplug ac lp floppy pcspkr serio_raw 8139cp
8139too
parport_pc parport mii ide_cd cdrom ata_piix libata sd_mod scsi_mod ext3
jbd
ehci_hcd ohci_hcd uhci_hcd
Pid: 0, comm: swapper Not tainted 2.6.18-53.el5 #1
RIP: 0010:[80134673]  [80134673]
elv_dequeue_request+0x8/0x3c
RSP: 0018:8040ddc0  EFLAGS: 00010046
RAX: 0001 RBX: 81011381b398 RCX: 
RDX: 81011381b398 RSI: 81011381b398 RDI: 81011fb912c0
RBP: 804abe18 R08: 80304108 R09: 0012
R10: 0022 R11:  R12: 
R13: 0001 R14: 0086 R15: 8040deb8
FS:  () GS:80396000()
knlGS:
CS:  0010 DS: 0018 ES: 0018 CR0: 8005003b
CR2: 2ad6f4d0 CR3: 0001126cc000 CR4: 06e0
Process swapper (pid: 0, threadinfo 803c6000, task
802dcae0)
Stack:  8000ae3c 804abe18 804abe50

 804abd00 0246 8003ba73 8003ba0c
 804abe18 81011fbe5800 8000d2a5 81011fb8c5c0
Call Trace:
 IRQ  [8000ae3c] ide_end_request+0xc6/0xfc
 [8003ba73] ide_dma_intr+0x67/0xab
 [8003ba0c] ide_dma_intr+0x0/0xab
 [8000d2a5] ide_intr+0x16f/0x1df
 [800107a0] handle_IRQ_event+0x29/0x58
 [800b5482] __do_IRQ+0xa4/0x105
 [8006a3bd] do_IRQ+0xe7/0xf5
 [8005b615] ret_from_intr+0x0/0xa
 [80011ca9] __do_softirq+0x53/0xd5
 [8005c2fc] call_softirq+0x1c/0x28
 [8006a53a] do_softirq+0x2c/0x85
 [80068d0e] default_idle+0x0/0x50
 [8005bc8e] apic_timer_interrupt+0x66/0x6c
 EOI  [80068d37] default_idle+0x29/0x50
 [80046f8d] cpu_idle+0x95/0xb8
 [803d1806] start_kernel+0x220/0x225
 [803d1237] _sinittext+0x237/0x23e


Code: 0f 0b 68 25 50 29 80 c2 30 02 48 8b 46 08 48 89 42 08 48 89 
RIP  [80134673] elv_dequeue_request+0x8/0x3c
 RSP 8040ddc0
 0Kernel panic - not syncing: Fatal exception
 BUG: warning at kernel/panic.c:137/panic() (Not tainted)

Call Trace:
 IRQ  [8008ccca] panic+0x1e3/0x1f4
 [80196ae8] do_unblank_screen+0x1b/0x132
 [800631aa] oops_end+0x51/0x53
 [80069689] die+0x3a/0x44
 [80069c37] do_invalid_op+0xad/0xb7
 [80134673] elv_dequeue_request+0x8/0x3c
 [80092dd4] do_timer+0x2e8/0x53c
 [8006c0cc] main_timer_handler+0x23d/0x3f4
 [8005bde9] error_exit+0x0/0x84
 [80134673] 

[PATCH 3/5] Figure out device capability

2008-11-24 Thread Sheng Yang
Try to figure out device capability in update_dev_cap(). Now we are only care
about MSI capability.

The function pci_find_cap_offset original function wrote by Allen for Xen.
Notice the function need root privilege to work. This depends on libpci to work.

(Update: Make update_dev_cap() more generic.)

Signed-off-by: Allen Kay [EMAIL PROTECTED]
Signed-off-by: Sheng Yang [EMAIL PROTECTED]
---
 qemu/hw/device-assignment.c |   50 +++
 qemu/hw/device-assignment.h |5 
 2 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/qemu/hw/device-assignment.c b/qemu/hw/device-assignment.c
index 786b2f0..f79cc67 100644
--- a/qemu/hw/device-assignment.c
+++ b/qemu/hw/device-assignment.c
@@ -216,6 +216,35 @@ static void assigned_dev_ioport_map(PCIDevice *pci_dev, 
int region_num,
   (r_dev-v_addrs + region_num));
 }
 
+uint8_t pci_find_cap_offset(struct pci_dev *pci_dev, uint8_t cap)
+{
+int id;
+int max_cap = 48;
+int pos = PCI_CAPABILITY_LIST;
+int status;
+
+status = pci_read_byte(pci_dev, PCI_STATUS);
+if ((status  PCI_STATUS_CAP_LIST) == 0)
+return 0;
+
+while (max_cap--) {
+pos = pci_read_byte(pci_dev, pos);
+if (pos  0x40)
+break;
+
+pos = ~3;
+id = pci_read_byte(pci_dev, pos + PCI_CAP_LIST_ID);
+
+if (id == 0xff)
+break;
+if (id == cap)
+return pos;
+
+pos += PCI_CAP_LIST_NEXT;
+}
+return 0;
+}
+
 static void assigned_dev_pci_write_config(PCIDevice *d, uint32_t address,
   uint32_t val, int len)
 {
@@ -367,6 +396,25 @@ static int assigned_dev_register_regions(PCIRegion 
*io_regions,
 return 0;
 }
 
+static void update_dev_cap(AssignedDevice *pci_dev, uint8_t r_bus,
+   uint8_t r_dev, uint8_t r_func)
+{
+struct pci_access *pacc;
+struct pci_dev *pdev;
+int r;
+
+pacc = pci_alloc();
+pci_init(pacc);
+pdev = pci_get_dev(pacc, 0, r_bus, r_dev, r_func);
+pci_cleanup(pacc);
+#ifdef KVM_CAP_DEVICE_MSI
+r = pci_find_cap_offset(pdev, PCI_CAP_ID_MSI);
+if (r)
+pci_dev-cap.available |= ASSIGNED_DEVICE_CAP_MSI;
+#endif
+pci_free_dev(pdev);
+}
+
 static int get_real_device(AssignedDevice *pci_dev, uint8_t r_bus,
uint8_t r_dev, uint8_t r_func)
 {
@@ -436,6 +484,8 @@ again:
 fclose(f);
 
 dev-region_number = r;
+
+update_dev_cap(pci_dev, r_bus, r_dev, r_func);
 return 0;
 }
 
diff --git a/qemu/hw/device-assignment.h b/qemu/hw/device-assignment.h
index d6caa67..de60988 100644
--- a/qemu/hw/device-assignment.h
+++ b/qemu/hw/device-assignment.h
@@ -29,6 +29,7 @@
 #define __DEVICE_ASSIGNMENT_H__
 
 #include sys/mman.h
+#include pci/pci.h
 #include qemu-common.h
 #include sys-queue.h
 #include pci.h
@@ -80,6 +81,10 @@ typedef struct {
 unsigned char h_busnr;
 unsigned int h_devfn;
 int bound;
+struct {
+#define ASSIGNED_DEVICE_CAP_MSI (1  0)
+int available;
+} cap;
 } AssignedDevice;
 
 typedef struct AssignedDevInfo AssignedDevInfo;
-- 
1.5.4.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Patch 4/5] x86_emulator: add the assembler code for three operands

2008-11-24 Thread Guillaume Thouvenin
On Tue, 04 Nov 2008 12:21:30 +0200
Avi Kivity [EMAIL PROTECTED] wrote:

 Guillaume Thouvenin wrote:
  Add the assembler code for three operands
 
   
  +/* Instruction has three operands */
  +/* In the switch we only implement case 4 because we know that for shld 
  instruction
  + * bytes are equal to 4. When eveything will be fine, we will add others 
  cases.

 
 No, shld is defined for 16, 32, and 64 bit operands.  Need to implement 
 those too.

I tried something like:

+/* Instruction has three operands */
+/* In the switch we only implement case 4 because we know that for shld 
instruction
+ * bytes are equal to 4. When eveything will be fine, we will add others cases.
+ */
+#define 
__emulate_2op_cl(_op,_src,_src2,_dst,_eflags,_by,_bx,_wx,_wy,_lx,_ly,_qx,_qy)  \
+   do {
\
+   unsigned long _tmp; 
\
+   
\
+   switch((_dst).bytes) {  
\
+   case 2: 
\
+   __asm__ __volatile__ (  
\
+   _PRE_EFLAGS(0, 5, 2)  
\
+   mov %4, %%rcx \n\t
\
+   _opw %%cl,%3,%1; \n\t 
\
+   _POST_EFLAGS(0, 5, 2) 
\
+   : =m (_eflags), =m ((_dst).val),
\
+ =r (_tmp)  
\
+   : _wy ((_src).val) , _wy ((_src2).val), i 
(EFLAGS_MASK) \
+   : %rcx ); 
\
+   break;  
\
+   case 4: 
\
+   __asm__ __volatile__ (  
\
+   _PRE_EFLAGS(0, 5, 2)  
\
+   mov %4, %%rcx \n\t
\
+   _opl %%cl,%3,%1; \n\t 
\
+   _POST_EFLAGS(0, 5, 2) 
\
+   : =m (_eflags), =m ((_dst).val),
\
+ =r (_tmp)  
\
+   : _ly ((_src).val) , _ly ((_src2).val), i 
(EFLAGS_MASK) \
+   : %rcx ); 
\
+   break;  
\
+   case 8: 
\
+   __asm__ __volatile__ (  
\
+   _PRE_EFLAGS(0, 5, 2)  
\
+   mov %4, %%rcx \n\t
\
+   _opq %%cl,%3,%1; \n\t 
\
+   _POST_EFLAGS(0, 5, 2) 
\
+   : =m (_eflags), =m ((_dst).val),
\
+ =r (_tmp)  
\
+   : _ly ((_src).val) , _ly ((_src2).val), i 
(EFLAGS_MASK) \
+   : %rcx ); 
\
+   break;  
\
+   }   
\
+   } while (0)
+
+#define emulate_2op_cl(_op, _src, _src2, _dst, _eflags)\
+ __emulate_2op_cl(_op, _src, _src2, _dst, _eflags, \
+   b, r, b, r, b, r, b, r)
+

but it doesn't work because shld can not be used with suffix 'l' or 'w'
etc... Is the solution is to have a single case for all operand size like:

+#define __emulate_2op_cl(_op,_src,_src2,_dst,_eflags,_wx,_wy)  \
+   do {
\
+   unsigned long _tmp; 
\
+ \
+   __asm__ __volatile__ (  \
+   _PRE_EFLAGS(0, 5, 2)  
\
+ 

Re: [PATCH 0 of 2] libcflat test for PowerPC

2008-11-24 Thread Hollis Blanchard
On Sat, 2008-11-22 at 13:17 -0600, Deepa Srinivasan wrote:
 Add Hello world test for libcflat. Also, fix CFLAGS issue in 
 config-powerpc.mak.

These look good, except int main() should be int main(void). I'll
fix myself and commit.

Thanks.

-- 
Hollis Blanchard
IBM Linux Technology Center

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] kvm: ppc: stop leaking host memory on VM exit

2008-11-24 Thread Liu Yu

Good catch.
 
 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Hollis Blanchard
 Sent: Tuesday, November 25, 2008 1:38 AM
 To: Avi Kivity
 Cc: kvm-ppc; kvm
 Subject: [PATCH] kvm: ppc: stop leaking host memory on VM exit
 
 When the VM exits, we must call put_page() for every page 
 referenced in the
 shadow TLB.
 
 Without this patch, we usually leak 30-50 host pages (120 - 
 200 KiB with 4 KiB
 pages). The maximum number of pages leaked is the size of our 
 shadow TLB, 64
 pages.
 
 Signed-off-by: Hollis Blanchard [EMAIL PROTECTED]
 ---
 The obvious question is why didn't we see this before? 
 Basically, we'd never
 looked for it, and since most of our work was in the kernel 
 we always ended up
 rebooting before exhausting host memory.
 
 Since it's such a large leak, and a simple fix, please commit 
 this for 2.6.28.
 This patch does apply to kvm.git with fuzz, but if you prefer 
 I can send a
 separate patch for that later.
 
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html