[COMMIT master] device-assignment: Fix off-by-one in header check
From: Alex Williamson alex.william...@redhat.com Include the first byte at 40h or else access might go to the hardware instead of the emulated config space, resulting in capability loops, since the ordering is different. Signed-off-by: Alex Williamson alex.william...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 832c236..6d6e657 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -410,7 +410,7 @@ static void assigned_dev_pci_write_config(PCIDevice *d, uint32_t address, ((d-devfn 3) 0x1F), (d-devfn 0x7), (uint16_t) address, val, len); -if (address PCI_CONFIG_HEADER_SIZE d-config_map[address]) { +if (address = PCI_CONFIG_HEADER_SIZE d-config_map[address]) { return assigned_device_pci_cap_write_config(d, address, val, len); } @@ -456,7 +456,7 @@ static uint32_t assigned_dev_pci_read_config(PCIDevice *d, uint32_t address, if (address 0x4 || (pci_dev-need_emulate_cmd address == 0x4) || (address = 0x10 address = 0x24) || address == 0x30 || address == 0x34 || address == 0x3c || address == 0x3d || -(address PCI_CONFIG_HEADER_SIZE d-config_map[address])) { +(address = PCI_CONFIG_HEADER_SIZE d-config_map[address])) { val = pci_default_read_config(d, address, len); DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n, (d-devfn 3) 0x1F, (d-devfn 0x7), address, val, len); -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] pci: Remove PCI_CAPABILITY_CONFIG_*
From: Alex Williamson alex.william...@redhat.com Half of these aren't used anywhere, the other half are wrong. Now that device assignment is trying to match physical hardware offsets for PCI capabilities, we can't round up the MSI and MSI-X length. MSI-X is always 12 bytes. MSI is variable length depending on features, but for the current device assignment implementation, it's always the minimum length of 10 bytes. Signed-off-by: Alex Williamson alex.william...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 6d6e657..1a90a89 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1302,10 +1302,9 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) * MSI capability is the 1st capability in capability config */ if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSI))) { dev-cap.available |= ASSIGNED_DEVICE_CAP_MSI; -pci_add_capability(pci_dev, PCI_CAP_ID_MSI, pos, - PCI_CAPABILITY_CONFIG_MSI_LENGTH); - /* Only 32-bit/no-mask currently supported */ +pci_add_capability(pci_dev, PCI_CAP_ID_MSI, pos, 10); + pci_set_word(pci_dev-config + pos + PCI_MSI_FLAGS, pci_get_word(pci_dev-config + pos + PCI_MSI_FLAGS) PCI_MSI_FLAGS_QMASK); @@ -1326,8 +1325,7 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) uint32_t msix_table_entry; dev-cap.available |= ASSIGNED_DEVICE_CAP_MSIX; -pci_add_capability(pci_dev, PCI_CAP_ID_MSIX, pos, - PCI_CAPABILITY_CONFIG_MSIX_LENGTH); +pci_add_capability(pci_dev, PCI_CAP_ID_MSIX, pos, 12); pci_set_word(pci_dev-config + pos + PCI_MSIX_FLAGS, pci_get_word(pci_dev-config + pos + PCI_MSIX_FLAGS) diff --git a/hw/pci.h b/hw/pci.h index 34955d8..d579738 100644 --- a/hw/pci.h +++ b/hw/pci.h @@ -122,11 +122,6 @@ enum { QEMU_PCI_CAP_MULTIFUNCTION = (1 QEMU_PCI_CAP_MULTIFUNCTION_BITNR), }; -#define PCI_CAPABILITY_CONFIG_MAX_LENGTH 0x60 -#define PCI_CAPABILITY_CONFIG_DEFAULT_START_ADDR 0x40 -#define PCI_CAPABILITY_CONFIG_MSI_LENGTH 0x10 -#define PCI_CAPABILITY_CONFIG_MSIX_LENGTH 0x10 - typedef int (*msix_mask_notifier_func)(PCIDevice *, unsigned vector, int masked); -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] pci: Error on PCI capability collisions
From: Alex Williamson alex.william...@redhat.com Nothing good can happen when we overlap capabilities Signed-off-by: Alex Williamson alex.william...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/hw/pci.c b/hw/pci.c index b08113d..288d6fd 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -1845,6 +1845,20 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id, if (!offset) { return -ENOSPC; } +} else { +int i; + +for (i = offset; i offset + size; i++) { +if (pdev-config_map[i]) { +fprintf(stderr, ERROR: %04x:%02x:%02x.%x +Attempt to add PCI capability %x at offset +%x overlaps existing capability %x at offset %x\n, +pci_find_domain(pdev-bus), pci_bus_num(pdev-bus), +PCI_SLOT(pdev-devfn), PCI_FUNC(pdev-devfn), +cap_id, offset, pdev-config_map[i], i); +return -EFAULT; +} +} } config = pdev-config + offset; -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] device-assignment: Error checking when adding capabilities
From: Alex Williamson alex.william...@redhat.com Signed-off-by: Alex Williamson alex.william...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 1a90a89..0ae04de 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -1288,7 +1288,7 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) { AssignedDevice *dev = container_of(pci_dev, AssignedDevice, dev); PCIRegion *pci_region = dev-real_device.regions; -int pos; +int ret, pos; /* Clear initial capabilities pointer and status copied from hw */ pci_set_byte(pci_dev-config + PCI_CAPABILITY_LIST, 0); @@ -1303,7 +1303,9 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSI))) { dev-cap.available |= ASSIGNED_DEVICE_CAP_MSI; /* Only 32-bit/no-mask currently supported */ -pci_add_capability(pci_dev, PCI_CAP_ID_MSI, pos, 10); +if ((ret = pci_add_capability(pci_dev, PCI_CAP_ID_MSI, pos, 10)) 0) { +return ret; +} pci_set_word(pci_dev-config + pos + PCI_MSI_FLAGS, pci_get_word(pci_dev-config + pos + PCI_MSI_FLAGS) @@ -1325,7 +1327,9 @@ static int assigned_device_pci_cap_init(PCIDevice *pci_dev) uint32_t msix_table_entry; dev-cap.available |= ASSIGNED_DEVICE_CAP_MSIX; -pci_add_capability(pci_dev, PCI_CAP_ID_MSIX, pos, 12); +if ((ret = pci_add_capability(pci_dev, PCI_CAP_ID_MSIX, pos, 12)) 0) { +return ret; +} pci_set_word(pci_dev-config + pos + PCI_MSIX_FLAGS, pci_get_word(pci_dev-config + pos + PCI_MSIX_FLAGS) -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] device-assignment: pass through and stub more PCI caps
From: Alex Williamson alex.william...@redhat.com Some drivers depend on finding capabilities like power management, PCI express/X, vital product data, or vendor specific fields. Now that we have better capability support, we can pass more of these tables through to the guest. Note that VPD and VNDR are direct pass through capabilies, the rest are mostly empty shells with a few writable bits where necessary. It may be possible to consolidate dummy capabilities into common files for other drivers to use, but I prefer to leave them here for now as we figure out what bits to handle directly with hardware and what bits are purely emulated. Signed-off-by: Alex Williamson alex.william...@redhat.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 0ae04de..50c6408 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -67,6 +67,9 @@ static void assigned_device_pci_cap_write_config(PCIDevice *pci_dev, uint32_t address, uint32_t val, int len); +static uint32_t assigned_device_pci_cap_read_config(PCIDevice *pci_dev, +uint32_t address, int len); + static uint32_t assigned_dev_ioport_rw(AssignedDevRegion *dev_region, uint32_t addr, int len, uint32_t *val) { @@ -370,11 +373,32 @@ static uint8_t assigned_dev_pci_read_byte(PCIDevice *d, int pos) return (uint8_t)assigned_dev_pci_read(d, pos, 1); } -static uint8_t pci_find_cap_offset(PCIDevice *d, uint8_t cap) +static void assigned_dev_pci_write(PCIDevice *d, int pos, uint32_t val, int len) +{ +AssignedDevice *pci_dev = container_of(d, AssignedDevice, dev); +ssize_t ret; +int fd = pci_dev-real_device.config_fd; + +again: +ret = pwrite(fd, val, len, pos); +if (ret != len) { + if ((ret 0) (errno == EINTR || errno == EAGAIN)) + goto again; + + fprintf(stderr, %s: pwrite failed, ret = %zd errno = %d\n, + __func__, ret, errno); + + exit(1); +} + +return; +} + +static uint8_t pci_find_cap_offset(PCIDevice *d, uint8_t cap, uint8_t start) { int id; int max_cap = 48; -int pos = PCI_CAPABILITY_LIST; +int pos = start ? start : PCI_CAPABILITY_LIST; int status; status = assigned_dev_pci_read_byte(d, PCI_STATUS); @@ -453,10 +477,16 @@ static uint32_t assigned_dev_pci_read_config(PCIDevice *d, uint32_t address, ssize_t ret; AssignedDevice *pci_dev = container_of(d, AssignedDevice, dev); +if (address = PCI_CONFIG_HEADER_SIZE d-config_map[address]) { +val = assigned_device_pci_cap_read_config(d, address, len); +DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n, + (d-devfn 3) 0x1F, (d-devfn 0x7), address, val, len); +return val; +} + if (address 0x4 || (pci_dev-need_emulate_cmd address == 0x4) || (address = 0x10 address = 0x24) || address == 0x30 || -address == 0x34 || address == 0x3c || address == 0x3d || -(address = PCI_CONFIG_HEADER_SIZE d-config_map[address])) { +address == 0x34 || address == 0x3c || address == 0x3d) { val = pci_default_read_config(d, address, len); DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n, (d-devfn 3) 0x1F, (d-devfn 0x7), address, val, len); @@ -1251,7 +1281,70 @@ static void assigned_dev_update_msix(PCIDevice *pci_dev, unsigned int ctrl_pos) #endif #endif -static void assigned_device_pci_cap_write_config(PCIDevice *pci_dev, uint32_t address, +/* There can be multiple VNDR capabilities per device, we need to find the + * one that starts closet to the given address without going over. */ +static uint8_t find_vndr_start(PCIDevice *pci_dev, uint32_t address) +{ +uint8_t cap, pos; + +for (cap = pos = 0; + (pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_VNDR, pos)); + pos += PCI_CAP_LIST_NEXT) { +if (pos = address) { +cap = MAX(pos, cap); +} +} +return cap; +} + +/* Merge the bits set in mask from mval into val. Both val and mval are + * at the same addr offset, pos is the starting offset of the mask. */ +static uint32_t merge_bits(uint32_t val, uint32_t mval, uint8_t addr, + int len, uint8_t pos, uint32_t mask) +{ +if (!ranges_overlap(addr, len, pos, 4)) { +return val; +} + +if (addr = pos) { +mask = (addr - pos) * 8; +} else { +mask = (pos - addr) * 8; +} +mask = 0xU (4 - len) * 8; + +val = ~mask; +val |= (mval mask); + +return val; +} + +static uint32_t assigned_device_pci_cap_read_config(PCIDevice *pci_dev, +uint32_t address, int len) +{ +uint8_t cap, cap_id = pci_dev-config_map[address]; +uint32_t val; + +switch (cap_id) { +
[COMMIT master] KVM: SVM: Add clean-bit for intercetps, tsc-offset and pause filter count
From: Joerg Roedel joerg.roe...@amd.com This patch adds the clean-bit for intercepts-vectors, the TSC offset and the pause-filter count to the appropriate places. The IO and MSR permission bitmaps are not subject to this bit. Signed-off-by: Joerg Roedel joerg.roe...@amd.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 0904c11..609f661 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -186,6 +186,8 @@ static int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr, bool has_error_code, u32 error_code); enum { + VMCB_INTERCEPTS, /* Intercept vectors, TSC offset, + pause filter count */ VMCB_DIRTY_MAX, }; @@ -217,6 +219,8 @@ static void recalc_intercepts(struct vcpu_svm *svm) struct vmcb_control_area *c, *h; struct nested_state *g; + mark_dirty(svm-vmcb, VMCB_INTERCEPTS); + if (!is_guest_mode(svm-vcpu)) return; @@ -854,6 +858,8 @@ static void svm_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset) } svm-vmcb-control.tsc_offset = offset + g_tsc_offset; + + mark_dirty(svm-vmcb, VMCB_INTERCEPTS); } static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment) @@ -863,6 +869,7 @@ static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment) svm-vmcb-control.tsc_offset += adjustment; if (is_guest_mode(vcpu)) svm-nested.hsave-control.tsc_offset += adjustment; + mark_dirty(svm-vmcb, VMCB_INTERCEPTS); } static void init_vmcb(struct vcpu_svm *svm) -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: SVM: Add clean-bit for IOPM_BASE and MSRPM_BASE
From: Joerg Roedel joerg.roe...@amd.com This patch adds the clean bit for the physical addresses of the MSRPM and the IOPM. It does not need to be set in the code because the only place where these values are changed is the nested-svm vmrun and vmexit path. These functions already mark the complete VMCB as dirty. Signed-off-by: Joerg Roedel joerg.roe...@amd.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 609f661..1802f7c 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -188,6 +188,7 @@ static int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr, enum { VMCB_INTERCEPTS, /* Intercept vectors, TSC offset, pause filter count */ + VMCB_PERM_MAP, /* IOPM Base and MSRPM Base */ VMCB_DIRTY_MAX, }; -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: SVM: Add clean-bits infrastructure code
From: Roedel, Joerg joerg.roe...@amd.com This patch adds the infrastructure for the implementation of the individual clean-bits. Signed-off-by: Joerg Roedel joerg.roe...@amd.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h index 11dbca7..235dd73 100644 --- a/arch/x86/include/asm/svm.h +++ b/arch/x86/include/asm/svm.h @@ -79,7 +79,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area { u32 event_inj_err; u64 nested_cr3; u64 lbr_ctl; - u64 reserved_5; + u32 clean; + u32 reserved_5; u64 next_rip; u8 reserved_6[816]; }; diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index ae943bb..0904c11 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -185,6 +185,28 @@ static int nested_svm_vmexit(struct vcpu_svm *svm); static int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr, bool has_error_code, u32 error_code); +enum { + VMCB_DIRTY_MAX, +}; + +#define VMCB_ALWAYS_DIRTY_MASK 0U + +static inline void mark_all_dirty(struct vmcb *vmcb) +{ + vmcb-control.clean = 0; +} + +static inline void mark_all_clean(struct vmcb *vmcb) +{ + vmcb-control.clean = ((1 VMCB_DIRTY_MAX) - 1) + ~VMCB_ALWAYS_DIRTY_MASK; +} + +static inline void mark_dirty(struct vmcb *vmcb, int bit) +{ + vmcb-control.clean = ~(1 bit); +} + static inline struct vcpu_svm *to_svm(struct kvm_vcpu *vcpu) { return container_of(vcpu, struct vcpu_svm, vcpu); @@ -973,6 +995,8 @@ static void init_vmcb(struct vcpu_svm *svm) set_intercept(svm, INTERCEPT_PAUSE); } + mark_all_dirty(svm-vmcb); + enable_gif(svm); } @@ -1089,6 +1113,7 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu) if (unlikely(cpu != vcpu-cpu)) { svm-asid_generation = 0; + mark_all_dirty(svm-vmcb); } #ifdef CONFIG_X86_64 @@ -2140,6 +2165,8 @@ static int nested_svm_vmexit(struct vcpu_svm *svm) svm-vmcb-save.cpl = 0; svm-vmcb-control.exit_int_info = 0; + mark_all_dirty(svm-vmcb); + nested_svm_unmap(page); nested_svm_uninit_mmu_context(svm-vcpu); @@ -2351,6 +2378,8 @@ static bool nested_svm_vmrun(struct vcpu_svm *svm) enable_gif(svm); + mark_all_dirty(svm-vmcb); + return true; } @@ -3488,6 +3517,8 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu) if (unlikely(svm-vmcb-control.exit_code == SVM_EXIT_EXCP_BASE + MC_VECTOR)) svm_handle_mce(svm); + + mark_all_clean(svm-vmcb); } #undef R -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: SVM: Add clean-bit for the ASID
From: Joerg Roedel joerg.roe...@amd.com This patch implements the clean-bit for the asid in the vmcb. Signed-off-by: Joerg Roedel joerg.roe...@amd.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 1802f7c..a3fd9ba 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -189,6 +189,7 @@ enum { VMCB_INTERCEPTS, /* Intercept vectors, TSC offset, pause filter count */ VMCB_PERM_MAP, /* IOPM Base and MSRPM Base */ + VMCB_ASID, /* ASID */ VMCB_DIRTY_MAX, }; @@ -1488,6 +1489,8 @@ static void new_asid(struct vcpu_svm *svm, struct svm_cpu_data *sd) svm-asid_generation = sd-asid_generation; svm-vmcb-control.asid = sd-next_asid++; + + mark_dirty(svm-vmcb, VMCB_ASID); } static void svm_set_dr7(struct kvm_vcpu *vcpu, unsigned long value) -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: SVM: Add clean-bit for interrupt state
From: Joerg Roedel joerg.roe...@amd.com This patch implements the clean-bit for all interrupt related state in the vmcb. This corresponds to vmcb offset 0x60-0x67. Signed-off-by: Joerg Roedel joerg.roe...@amd.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index a3fd9ba..b98092d 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -190,10 +190,12 @@ enum { pause filter count */ VMCB_PERM_MAP, /* IOPM Base and MSRPM Base */ VMCB_ASID, /* ASID */ + VMCB_INTR, /* int_ctl, int_vector */ VMCB_DIRTY_MAX, }; -#define VMCB_ALWAYS_DIRTY_MASK 0U +/* TPR is always written before VMRUN */ +#define VMCB_ALWAYS_DIRTY_MASK (1U VMCB_INTR) static inline void mark_all_dirty(struct vmcb *vmcb) { @@ -2508,6 +2510,8 @@ static int clgi_interception(struct vcpu_svm *svm) svm_clear_vintr(svm); svm-vmcb-control.int_ctl = ~V_IRQ_MASK; + mark_dirty(svm-vmcb, VMCB_INTR); + return 1; } @@ -2878,6 +2882,7 @@ static int interrupt_window_interception(struct vcpu_svm *svm) kvm_make_request(KVM_REQ_EVENT, svm-vcpu); svm_clear_vintr(svm); svm-vmcb-control.int_ctl = ~V_IRQ_MASK; + mark_dirty(svm-vmcb, VMCB_INTR); /* * If the user space waits to inject interrupts, exit as soon as * possible @@ -3169,6 +3174,7 @@ static inline void svm_inject_irq(struct vcpu_svm *svm, int irq) control-int_ctl = ~V_INTR_PRIO_MASK; control-int_ctl |= V_IRQ_MASK | ((/*control-int_vector 4*/ 0xf) V_INTR_PRIO_SHIFT); + mark_dirty(svm-vmcb, VMCB_INTR); } static void svm_set_irq(struct kvm_vcpu *vcpu) -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: SVM: Add clean-bit for control registers
From: Joerg Roedel joerg.roe...@amd.com This patch implements the CRx clean-bit for the vmcb. This bit covers cr0, cr3, cr4, and efer. Signed-off-by: Joerg Roedel joerg.roe...@amd.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 2a63dfa..135727c 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -192,6 +192,7 @@ enum { VMCB_ASID, /* ASID */ VMCB_INTR, /* int_ctl, int_vector */ VMCB_NPT,/* npt_en, nCR3, gPAT */ + VMCB_CR, /* CR0, CR3, CR4, EFER */ VMCB_DIRTY_MAX, }; @@ -441,6 +442,7 @@ static void svm_set_efer(struct kvm_vcpu *vcpu, u64 efer) efer = ~EFER_LME; to_svm(vcpu)-vmcb-save.efer = efer | EFER_SVME; + mark_dirty(to_svm(vcpu)-vmcb, VMCB_CR); } static int is_external_interrupt(u32 info) @@ -1338,6 +1340,7 @@ static void update_cr0_intercept(struct vcpu_svm *svm) *hcr0 = (*hcr0 ~SVM_CR0_SELECTIVE_MASK) | (gcr0 SVM_CR0_SELECTIVE_MASK); + mark_dirty(svm-vmcb, VMCB_CR); if (gcr0 == *hcr0 svm-vcpu.fpu_active) { clr_cr_intercept(svm, INTERCEPT_CR0_READ); @@ -1404,6 +1407,7 @@ static void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0) */ cr0 = ~(X86_CR0_CD | X86_CR0_NW); svm-vmcb-save.cr0 = cr0; + mark_dirty(svm-vmcb, VMCB_CR); update_cr0_intercept(svm); } @@ -1420,6 +1424,7 @@ static void svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) cr4 |= X86_CR4_PAE; cr4 |= host_cr4_mce; to_svm(vcpu)-vmcb-save.cr4 = cr4; + mark_dirty(to_svm(vcpu)-vmcb, VMCB_CR); } static void svm_set_segment(struct kvm_vcpu *vcpu, @@ -3547,6 +3552,7 @@ static void svm_set_cr3(struct kvm_vcpu *vcpu, unsigned long root) struct vcpu_svm *svm = to_svm(vcpu); svm-vmcb-save.cr3 = root; + mark_dirty(svm-vmcb, VMCB_CR); force_new_asid(vcpu); } @@ -3559,6 +3565,7 @@ static void set_tdp_cr3(struct kvm_vcpu *vcpu, unsigned long root) /* Also sync guest cr3 here in case we live migrate */ svm-vmcb-save.cr3 = vcpu-arch.cr3; + mark_dirty(svm-vmcb, VMCB_CR); force_new_asid(vcpu); } -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: SVM: Add clean-bit for Segements and CPL
From: Joerg Roedel joerg.roe...@amd.com This patch implements the clean-bit defined for the cs, ds, ss, an es segemnts and the current cpl saved in the vmcb. Signed-off-by: Joerg Roedel joerg.roe...@amd.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index bb640ae..85d3350 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -195,6 +195,7 @@ enum { VMCB_CR, /* CR0, CR3, CR4, EFER */ VMCB_DR, /* DR6, DR7 */ VMCB_DT, /* GDT, IDT */ + VMCB_SEG,/* CS, DS, SS, ES, CPL */ VMCB_DIRTY_MAX, }; @@ -1457,6 +1458,7 @@ static void svm_set_segment(struct kvm_vcpu *vcpu, = (svm-vmcb-save.cs.attrib SVM_SELECTOR_DPL_SHIFT) 3; + mark_dirty(svm-vmcb, VMCB_SEG); } static void update_db_intercept(struct kvm_vcpu *vcpu) -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: SVM: Add clean-bit for LBR state
From: Joerg Roedel joerg.roe...@amd.com This patch implements the clean-bit for all LBR related state. This includes the debugctl, br_from, br_to, last_excp_from, and last_excp_to msrs. Signed-off-by: Joerg Roedel joerg.roe...@amd.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index e5db339..05ae90a 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -197,6 +197,7 @@ enum { VMCB_DT, /* GDT, IDT */ VMCB_SEG,/* CS, DS, SS, ES, CPL */ VMCB_CR2,/* CR2 only */ + VMCB_LBR,/* DBGCTL, BR_FROM, BR_TO, LAST_EX_FROM, LAST_EX_TO */ VMCB_DIRTY_MAX, }; @@ -2847,6 +2848,7 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned ecx, u64 data) return 1; svm-vmcb-save.dbgctl = data; + mark_dirty(svm-vmcb, VMCB_LBR); if (data (1ULL0)) svm_enable_lbrv(svm); else -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: SVM: Add clean-bit for CR2 register
From: Joerg Roedel joerg.roe...@amd.com This patch implements the clean-bit for the cr2 register in the vmcb. Signed-off-by: Joerg Roedel joerg.roe...@amd.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 85d3350..e5db339 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -196,11 +196,12 @@ enum { VMCB_DR, /* DR6, DR7 */ VMCB_DT, /* GDT, IDT */ VMCB_SEG,/* CS, DS, SS, ES, CPL */ + VMCB_CR2,/* CR2 only */ VMCB_DIRTY_MAX, }; -/* TPR is always written before VMRUN */ -#define VMCB_ALWAYS_DIRTY_MASK (1U VMCB_INTR) +/* TPR and CR2 are always written before VMRUN */ +#define VMCB_ALWAYS_DIRTY_MASK ((1U VMCB_INTR) | (1U VMCB_CR2)) static inline void mark_all_dirty(struct vmcb *vmcb) { -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: MMU: rename 'no_apf' to 'prefault'
From: Xiao Guangrong xiaoguangr...@cn.fujitsu.com It's the speculative path if 'no_apf = 1' and we will specially handle this speculative path in the later patch, so 'prefault' is better to fit the sense. Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index cfbcbfa..f7e5066 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -241,7 +241,8 @@ struct kvm_mmu { void (*new_cr3)(struct kvm_vcpu *vcpu); void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long root); unsigned long (*get_cr3)(struct kvm_vcpu *vcpu); - int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err, bool no_apf); + int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err, + bool prefault); void (*inject_page_fault)(struct kvm_vcpu *vcpu, struct x86_exception *fault); void (*free)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index d75ba1e..4954de9 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2284,11 +2284,11 @@ static int kvm_handle_bad_page(struct kvm *kvm, gfn_t gfn, pfn_t pfn) return 1; } -static bool try_async_pf(struct kvm_vcpu *vcpu, bool no_apf, gfn_t gfn, +static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn, gva_t gva, pfn_t *pfn, bool write, bool *writable); static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn, -bool no_apf) +bool prefault) { int r; int level; @@ -2310,7 +2310,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn, mmu_seq = vcpu-kvm-mmu_notifier_seq; smp_rmb(); - if (try_async_pf(vcpu, no_apf, gfn, v, pfn, write, map_writable)) + if (try_async_pf(vcpu, prefault, gfn, v, pfn, write, map_writable)) return 0; /* mmio */ @@ -2583,7 +2583,7 @@ static gpa_t nonpaging_gva_to_gpa_nested(struct kvm_vcpu *vcpu, gva_t vaddr, } static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva, - u32 error_code, bool no_apf) + u32 error_code, bool prefault) { gfn_t gfn; int r; @@ -2599,7 +2599,7 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva, gfn = gva PAGE_SHIFT; return nonpaging_map(vcpu, gva PAGE_MASK, -error_code PFERR_WRITE_MASK, gfn, no_apf); +error_code PFERR_WRITE_MASK, gfn, prefault); } static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn) @@ -2621,7 +2621,7 @@ static bool can_do_async_pf(struct kvm_vcpu *vcpu) return kvm_x86_ops-interrupt_allowed(vcpu); } -static bool try_async_pf(struct kvm_vcpu *vcpu, bool no_apf, gfn_t gfn, +static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn, gva_t gva, pfn_t *pfn, bool write, bool *writable) { bool async; @@ -2633,7 +2633,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool no_apf, gfn_t gfn, put_page(pfn_to_page(*pfn)); - if (!no_apf can_do_async_pf(vcpu)) { + if (!prefault can_do_async_pf(vcpu)) { trace_kvm_try_async_get_page(gva, gfn); if (kvm_find_async_pf_gfn(vcpu, gfn)) { trace_kvm_async_pf_doublefault(gva, gfn); @@ -2649,7 +2649,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool no_apf, gfn_t gfn, } static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code, - bool no_apf) + bool prefault) { pfn_t pfn; int r; @@ -2673,7 +2673,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code, mmu_seq = vcpu-kvm-mmu_notifier_seq; smp_rmb(); - if (try_async_pf(vcpu, no_apf, gfn, gpa, pfn, write, map_writable)) + if (try_async_pf(vcpu, prefault, gfn, gpa, pfn, write, map_writable)) return 0; /* mmio */ diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index d5a0a11..52b3e91 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -539,7 +539,7 @@ out_gpte_changed: * a negative value on error. */ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code, -bool no_apf) +bool prefault) { int write_fault = error_code PFERR_WRITE_MASK; int user_fault = error_code PFERR_USER_MASK; @@ -581,7 +581,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code, mmu_seq = vcpu-kvm-mmu_notifier_seq; smp_rmb(); - if
[COMMIT master] KVM: SVM: Remove flush_guest_tlb function
From: Joerg Roedel joerg.roe...@amd.com This function is unused and there is svm_flush_tlb which does the same. So this function can be removed. Signed-off-by: Joerg Roedel joerg.roe...@amd.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 05ae90a..16334bb 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -426,11 +426,6 @@ static inline void force_new_asid(struct kvm_vcpu *vcpu) to_svm(vcpu)-asid_generation--; } -static inline void flush_guest_tlb(struct kvm_vcpu *vcpu) -{ - force_new_asid(vcpu); -} - static int get_npt_level(void) { #ifdef CONFIG_X86_64 -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: MMU: fix accessed bit set on prefault path
From: Xiao Guangrong xiaoguangr...@cn.fujitsu.com Retry #PF is the speculative path, so don't set the accessed bit Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 4954de9..04f9033 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2214,7 +2214,8 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, u64 *sptep) } static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write, - int map_writable, int level, gfn_t gfn, pfn_t pfn) + int map_writable, int level, gfn_t gfn, pfn_t pfn, + bool prefault) { struct kvm_shadow_walk_iterator iterator; struct kvm_mmu_page *sp; @@ -2229,7 +2230,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write, pte_access = ~ACC_WRITE_MASK; mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, pte_access, 0, write, 1, pt_write, -level, gfn, pfn, false, map_writable); +level, gfn, pfn, prefault, map_writable); direct_pte_prefetch(vcpu, iterator.sptep); ++vcpu-stat.pf_fixed; break; @@ -2321,7 +2322,8 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn, if (mmu_notifier_retry(vcpu, mmu_seq)) goto out_unlock; kvm_mmu_free_some_pages(vcpu); - r = __direct_map(vcpu, v, write, map_writable, level, gfn, pfn); + r = __direct_map(vcpu, v, write, map_writable, level, gfn, pfn, +prefault); spin_unlock(vcpu-kvm-mmu_lock); @@ -2684,7 +2686,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code, goto out_unlock; kvm_mmu_free_some_pages(vcpu); r = __direct_map(vcpu, gpa, write, map_writable, -level, gfn, pfn); +level, gfn, pfn, prefault); spin_unlock(vcpu-kvm-mmu_lock); return r; -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: MMU: retry #PF for softmmu
From: Xiao Guangrong xiaoguangr...@cn.fujitsu.com Retry #PF for softmmu only when the current vcpu has the same cr3 as the time when #PF occurs Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index f7e5066..b55d789 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -593,6 +593,7 @@ struct kvm_x86_ops { struct kvm_arch_async_pf { u32 token; gfn_t gfn; + unsigned long cr3; bool direct_map; }; diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 04f9033..1a953ac 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -2607,9 +2607,11 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva, static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn) { struct kvm_arch_async_pf arch; + arch.token = (vcpu-arch.apf.id++ 12) | vcpu-vcpu_id; arch.gfn = gfn; arch.direct_map = vcpu-arch.mmu.direct_map; + arch.cr3 = vcpu-arch.mmu.get_cr3(vcpu); return kvm_setup_async_pf(vcpu, gva, gfn, arch); } diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 52b3e91..146b681 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -438,7 +438,8 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, struct guest_walker *gw, static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, struct guest_walker *gw, int user_fault, int write_fault, int hlevel, -int *ptwrite, pfn_t pfn, bool map_writable) +int *ptwrite, pfn_t pfn, bool map_writable, +bool prefault) { unsigned access = gw-pt_access; struct kvm_mmu_page *sp = NULL; @@ -512,7 +513,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, mmu_set_spte(vcpu, it.sptep, access, gw-pte_access access, user_fault, write_fault, dirty, ptwrite, it.level, -gw-gfn, pfn, false, map_writable); +gw-gfn, pfn, prefault, map_writable); FNAME(pte_prefetch)(vcpu, gw, it.sptep); return it.sptep; @@ -568,8 +569,11 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code, */ if (!r) { pgprintk(%s: guest page fault\n, __func__); - inject_page_fault(vcpu, walker.fault); - vcpu-arch.last_pt_write_count = 0; /* reset fork detector */ + if (!prefault) { + inject_page_fault(vcpu, walker.fault); + /* reset fork detector */ + vcpu-arch.last_pt_write_count = 0; + } return 0; } @@ -599,7 +603,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code, trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT); kvm_mmu_free_some_pages(vcpu); sptep = FNAME(fetch)(vcpu, addr, walker, user_fault, write_fault, -level, write_pt, pfn, map_writable); +level, write_pt, pfn, map_writable, prefault); (void)sptep; pgprintk(%s: shadow pte %p %llx ptwrite %d\n, __func__, sptep, *sptep, write_pt); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ed373ba..018bb70 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6183,7 +6183,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) { int r; - if (!vcpu-arch.mmu.direct_map || !work-arch.direct_map || + if ((vcpu-arch.mmu.direct_map != work-arch.direct_map) || is_error_page(work-page)) return; @@ -6191,6 +6191,10 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work) if (unlikely(r)) return; + if (!vcpu-arch.mmu.direct_map + work-arch.cr3 != vcpu-arch.mmu.get_cr3(vcpu)) + return; + vcpu-arch.mmu.page_fault(vcpu, work-gva, 0, true); } -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: SVM: Use svm_flush_tlb instead of force_new_asid
From: Joerg Roedel joerg.roe...@amd.com This patch replaces all calls to force_new_asid which are intended to flush the guest-tlb by the more appropriate function svm_flush_tlb. As a side-effect the force_new_asid function is removed. Signed-off-by: Joerg Roedel joerg.roe...@amd.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 16334bb..b4aad21 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -421,11 +421,6 @@ static inline void invlpga(unsigned long addr, u32 asid) asm volatile (__ex(SVM_INVLPGA) : : a(addr), c(asid)); } -static inline void force_new_asid(struct kvm_vcpu *vcpu) -{ - to_svm(vcpu)-asid_generation--; -} - static int get_npt_level(void) { #ifdef CONFIG_X86_64 @@ -999,7 +994,7 @@ static void init_vmcb(struct vcpu_svm *svm) save-cr3 = 0; save-cr4 = 0; } - force_new_asid(svm-vcpu); + svm-asid_generation = 0; svm-nested.vmcb = 0; svm-vcpu.arch.hflags = 0; @@ -1419,7 +1414,7 @@ static void svm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4) unsigned long old_cr4 = to_svm(vcpu)-vmcb-save.cr4; if (npt_enabled ((old_cr4 ^ cr4) X86_CR4_PGE)) - force_new_asid(vcpu); + svm_flush_tlb(vcpu); vcpu-arch.cr4 = cr4; if (!npt_enabled) @@ -1762,7 +1757,7 @@ static void nested_svm_set_tdp_cr3(struct kvm_vcpu *vcpu, svm-vmcb-control.nested_cr3 = root; mark_dirty(svm-vmcb, VMCB_NPT); - force_new_asid(vcpu); + svm_flush_tlb(vcpu); } static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu, @@ -2366,7 +2361,7 @@ static bool nested_svm_vmrun(struct vcpu_svm *svm) svm-nested.intercept_exceptions = nested_vmcb-control.intercept_exceptions; svm-nested.intercept= nested_vmcb-control.intercept; - force_new_asid(svm-vcpu); + svm_flush_tlb(svm-vcpu); svm-vmcb-control.int_ctl = nested_vmcb-control.int_ctl | V_INTR_MASKING_MASK; if (nested_vmcb-control.int_ctl V_INTR_MASKING_MASK) svm-vcpu.arch.hflags |= HF_VINTR_MASK; @@ -3308,7 +3303,7 @@ static int svm_set_tss_addr(struct kvm *kvm, unsigned int addr) static void svm_flush_tlb(struct kvm_vcpu *vcpu) { - force_new_asid(vcpu); + to_svm(vcpu)-asid_generation--; } static void svm_prepare_guest_switch(struct kvm_vcpu *vcpu) @@ -3560,7 +3555,7 @@ static void svm_set_cr3(struct kvm_vcpu *vcpu, unsigned long root) svm-vmcb-save.cr3 = root; mark_dirty(svm-vmcb, VMCB_CR); - force_new_asid(vcpu); + svm_flush_tlb(vcpu); } static void set_tdp_cr3(struct kvm_vcpu *vcpu, unsigned long root) @@ -3574,7 +3569,7 @@ static void set_tdp_cr3(struct kvm_vcpu *vcpu, unsigned long root) svm-vmcb-save.cr3 = vcpu-arch.cr3; mark_dirty(svm-vmcb, VMCB_CR); - force_new_asid(vcpu); + svm_flush_tlb(vcpu); } static int is_disabled(void) -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: SVM: Implement Flush-By-Asid feature
From: Joerg Roedel joerg.roe...@amd.com This patch adds the new flush-by-asid of upcoming AMD processors to the KVM-AMD module. Signed-off-by: Joerg Roedel joerg.roe...@amd.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h index 235dd73..82ecaa3 100644 --- a/arch/x86/include/asm/svm.h +++ b/arch/x86/include/asm/svm.h @@ -88,6 +88,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area { #define TLB_CONTROL_DO_NOTHING 0 #define TLB_CONTROL_FLUSH_ALL_ASID 1 +#define TLB_CONTROL_FLUSH_ASID 3 +#define TLB_CONTROL_FLUSH_ASID_LOCAL 7 #define V_TPR_MASK 0x0f diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index b4aad21..740884b 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3158,7 +3158,6 @@ static void pre_svm_run(struct vcpu_svm *svm) struct svm_cpu_data *sd = per_cpu(svm_data, cpu); - svm-vmcb-control.tlb_ctl = TLB_CONTROL_DO_NOTHING; /* FIXME: handle wraparound of asid_generation */ if (svm-asid_generation != sd-asid_generation) new_asid(svm, sd); @@ -3303,7 +3302,12 @@ static int svm_set_tss_addr(struct kvm *kvm, unsigned int addr) static void svm_flush_tlb(struct kvm_vcpu *vcpu) { - to_svm(vcpu)-asid_generation--; + struct vcpu_svm *svm = to_svm(vcpu); + + if (static_cpu_has(X86_FEATURE_FLUSHBYASID)) + svm-vmcb-control.tlb_ctl = TLB_CONTROL_FLUSH_ASID; + else + svm-asid_generation--; } static void svm_prepare_guest_switch(struct kvm_vcpu *vcpu) @@ -3527,6 +3531,8 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu) svm-next_rip = 0; + svm-vmcb-control.tlb_ctl = TLB_CONTROL_DO_NOTHING; + /* if exit due to PF check for async PF */ if (svm-vmcb-control.exit_code == SVM_EXIT_EXCP_BASE + PF_VECTOR) svm-apf_reason = kvm_read_and_reset_pf_reason(); -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: VMX: add module parameter to avoid trapping HLT instructions (v5)
From: Anthony Liguori aligu...@us.ibm.com In certain use-cases, we want to allocate guests fixed time slices where idle guest cycles leave the machine idling. There are many approaches to achieve this but the most direct is to simply avoid trapping the HLT instruction which lets the guest directly execute the instruction putting the processor to sleep. Introduce this as a module-level option for kvm-vmx.ko since if you do this for one guest, you probably want to do it for all. Signed-off-by: Anthony Liguori aligu...@us.ibm.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h index 42d9590..9642c22 100644 --- a/arch/x86/include/asm/vmx.h +++ b/arch/x86/include/asm/vmx.h @@ -297,6 +297,12 @@ enum vmcs_field { #define GUEST_INTR_STATE_SMI 0x0004 #define GUEST_INTR_STATE_NMI 0x0008 +/* GUEST_ACTIVITY_STATE flags */ +#define GUEST_ACTIVITY_ACTIVE 0 +#define GUEST_ACTIVITY_HLT 1 +#define GUEST_ACTIVITY_SHUTDOWN2 +#define GUEST_ACTIVITY_WAIT_SIPI 3 + /* * Exit Qualifications for MOV for Control Register Access */ diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 72cfdb7..5c62ef2 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -69,6 +69,9 @@ module_param(emulate_invalid_guest_state, bool, S_IRUGO); static int __read_mostly vmm_exclusive = 1; module_param(vmm_exclusive, bool, S_IRUGO); +static int __read_mostly yield_on_hlt = 1; +module_param(yield_on_hlt, bool, S_IRUGO); + #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST \ (X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD) #define KVM_GUEST_CR0_MASK \ @@ -1009,6 +1012,17 @@ static void skip_emulated_instruction(struct kvm_vcpu *vcpu) vmx_set_interrupt_shadow(vcpu, 0); } +static void vmx_clear_hlt(struct kvm_vcpu *vcpu) +{ + /* Ensure that we clear the HLT state in the VMCS. We don't need to +* explicitly skip the instruction because if the HLT state is set, then +* the instruction is already executing and RIP has already been +* advanced. */ + if (!yield_on_hlt + vmcs_read32(GUEST_ACTIVITY_STATE) == GUEST_ACTIVITY_HLT) + vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE); +} + static void vmx_queue_exception(struct kvm_vcpu *vcpu, unsigned nr, bool has_error_code, u32 error_code, bool reinject) @@ -1035,6 +1049,7 @@ static void vmx_queue_exception(struct kvm_vcpu *vcpu, unsigned nr, intr_info |= INTR_TYPE_HARD_EXCEPTION; vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr_info); + vmx_clear_hlt(vcpu); } static bool vmx_rdtscp_supported(void) @@ -1419,7 +1434,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) _pin_based_exec_control) 0) return -EIO; - min = CPU_BASED_HLT_EXITING | + min = #ifdef CONFIG_X86_64 CPU_BASED_CR8_LOAD_EXITING | CPU_BASED_CR8_STORE_EXITING | @@ -1432,6 +1447,10 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) CPU_BASED_MWAIT_EXITING | CPU_BASED_MONITOR_EXITING | CPU_BASED_INVLPG_EXITING; + + if (yield_on_hlt) + min |= CPU_BASED_HLT_EXITING; + opt = CPU_BASED_TPR_SHADOW | CPU_BASED_USE_MSR_BITMAPS | CPU_BASED_ACTIVATE_SECONDARY_CONTROLS; @@ -2728,7 +2747,7 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu) vmcs_writel(GUEST_IDTR_BASE, 0); vmcs_write32(GUEST_IDTR_LIMIT, 0x); - vmcs_write32(GUEST_ACTIVITY_STATE, 0); + vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE); vmcs_write32(GUEST_INTERRUPTIBILITY_INFO, 0); vmcs_write32(GUEST_PENDING_DBG_EXCEPTIONS, 0); @@ -2821,6 +2840,7 @@ static void vmx_inject_irq(struct kvm_vcpu *vcpu) } else intr |= INTR_TYPE_EXT_INTR; vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr); + vmx_clear_hlt(vcpu); } static void vmx_inject_nmi(struct kvm_vcpu *vcpu) @@ -2848,6 +2868,7 @@ static void vmx_inject_nmi(struct kvm_vcpu *vcpu) } vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, INTR_TYPE_NMI_INTR | INTR_INFO_VALID_MASK | NMI_VECTOR); + vmx_clear_hlt(vcpu); } static int vmx_nmi_allowed(struct kvm_vcpu *vcpu) -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: Fix OSXSAVE after migration
From: Sheng Yang sh...@linux.intel.com CPUID's OSXSAVE is a mirror of CR4.OSXSAVE bit. We need to update the CPUID after migration. KVM-Stable-Tag. Signed-off-by: Sheng Yang sh...@linux.intel.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 018bb70..bb04957 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5585,6 +5585,8 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu, mmu_reset_needed |= kvm_read_cr4(vcpu) != sregs-cr4; kvm_x86_ops-set_cr4(vcpu, sregs-cr4); + if (sregs-cr4 X86_CR4_OSXSAVE) + update_cpuid(vcpu); if (!is_long_mode(vcpu) is_pae(vcpu)) { load_pdptrs(vcpu, vcpu-arch.walk_mmu, vcpu-arch.cr3); mmu_reset_needed = 1; -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: MMU: Fix incorrect direct page write protection due to ro host page
From: Avi Kivity a...@redhat.com If KVM sees a read-only host page, it will map it as read-only to prevent breaking a COW. However, if the page was part of a large guest page, KVM incorrectly extends the write protection to the entire large page frame instead of limiting it to the normal host page. This results in the instantiation of a new shadow page with read-only access. If this happens for a MOVS instruction that moves memory between two normal pages, within a single large page frame, and mapped within the guest as a large page, and if, in addition, the source operand is not writeable in the host (perhaps due to KSM), then KVM will instantiate a read-only direct shadow page, instantiate an spte for the source operand, then instantiate a new read/write direct shadow page and instantiate an spte for the destination operand. Since these two sptes are in different shadow pages, MOVS will never see them at the same time and the guest will not make progress. Fix by mapping the direct shadow page read/write, and only marking the host page read-only. Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 146b681..5ca9426 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -511,6 +511,9 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, link_shadow_page(it.sptep, sp); } + if (!map_writable) + access = ~ACC_WRITE_MASK; + mmu_set_spte(vcpu, it.sptep, access, gw-pte_access access, user_fault, write_fault, dirty, ptwrite, it.level, gw-gfn, pfn, prefault, map_writable); @@ -593,9 +596,6 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code, if (is_error_pfn(pfn)) return kvm_handle_bad_page(vcpu-kvm, walker.gfn, pfn); - if (!map_writable) - walker.pte_access = ~ACC_WRITE_MASK; - spin_lock(vcpu-kvm-mmu_lock); if (mmu_notifier_retry(vcpu, mmu_seq)) goto out_unlock; -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: Fix build error on s390 due to missing tlbs_dirty
From: Avi Kivity a...@redhat.com Make it available for all archs. Signed-off-by: Avi Kivity a...@redhat.com diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index bd0da8f..b5021db 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -256,8 +256,8 @@ struct kvm { struct mmu_notifier mmu_notifier; unsigned long mmu_notifier_seq; long mmu_notifier_count; - long tlbs_dirty; #endif + long tlbs_dirty; }; /* The guest did something we don't support. */ -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[COMMIT master] KVM: SVM: Do not report xsave in supported cpuid
From: Joerg Roedel joerg.roe...@amd.com To support xsave properly for the guest the SVM module need software support for it. As long as this is not present do not report the xsave as supported feature in cpuid. As a side-effect this patch moves the bit() helper function into the x86.h file so that it can be used in svm.c too. KVM-Stable-Tag. Signed-off-by: Joerg Roedel joerg.roe...@amd.com Signed-off-by: Avi Kivity a...@redhat.com diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 740884b..9b3d166 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3622,6 +3622,10 @@ static void svm_cpuid_update(struct kvm_vcpu *vcpu) static void svm_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry) { switch (func) { + case 0x0001: + /* Mask out xsave bit as long as it is not supported by SVM */ + entry-ecx = ~(bit(X86_FEATURE_XSAVE)); + break; case 0x8001: if (nested) entry-ecx |= (1 2); /* Set SVM bit */ diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 5c62ef2..c195260 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4268,11 +4268,6 @@ static int vmx_get_lpage_level(void) return PT_PDPE_LEVEL; } -static inline u32 bit(int bitno) -{ - return 1 (bitno 31); -} - static void vmx_cpuid_update(struct kvm_vcpu *vcpu) { struct kvm_cpuid_entry2 *best; diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index bb04957..8d76150 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -163,11 +163,6 @@ static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu) vcpu-arch.apf.gfns[i] = ~0; } -static inline u32 bit(int bitno) -{ - return 1 (bitno 31); -} - static void kvm_on_user_return(struct user_return_notifier *urn) { unsigned slot; diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index 2cea414..c600da8 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -70,6 +70,11 @@ static inline int is_paging(struct kvm_vcpu *vcpu) return kvm_read_cr0_bits(vcpu, X86_CR0_PG); } +static inline u32 bit(int bitno) +{ + return 1 (bitno 31); +} + void kvm_before_handle_nmi(struct kvm_vcpu *vcpu); void kvm_after_handle_nmi(struct kvm_vcpu *vcpu); int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq); -- To unsubscribe from this list: send the line unsubscribe kvm-commits in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/4] genirq: Introduce driver-readable IRQ status word
On Sun, 12 Dec 2010, Jan Kiszka wrote: Am 12.12.2010 18:29, Thomas Gleixner wrote: Also we should name it different than status, drv_status perhaps, to avoid confusion with the irq_desc status. OK, will address both in a succeeding round (just waiting for potential further comments). No further comments from my side ATM. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB Passthrough 1.1 performance problem...
On 12.12.2010, at 23:31, Erik Brakkee wrote: Jan Kiszka wrote: Are there some tuning parameters I can use or perhaps even kernel configuration paramters on the host to solve this? Cheers Erik Host:Motherboard Supermicro X8DTi-F, Intel Xeon L5630, 12MB OS: Opensuse 11.3 64 bit Guest: OS: Opensuse 11.3 64 bit I can say now that I am giving up on getting this to work. One alternative was to use PCI passthrough the USB hardware, but that didn't work for the USB that was on the motherboard. So I bought a USB PCI card and tried to use PCI passthrough for that. Unfortunately other problems occured there. For one, the problem with 4K alignment. But I could fix that by using the pci=resource_alignment=... kernel parameter. In my grub/menu.lst it says: kernel /vmlinuz-2.6.34.7-0.5-default root=/dev/hsystem/root quiet showopts intel_iommu=on pci=resource_alignment=01:04.0;01:04.1;01:04.2 noirqdebug vga=0x31a The noirqdebug flas was needed to avoid the host from disabling the IRQ (it was a shared IRQ). Using this, I could configure PCI passthrough and start the VM. Also the USB device showed up there. Only it did not work at all. Here is a summary of my journey up until know: The original approach I wanted to use was to pass my old PCI card (WinTV PVR-500) to a VM. This card is a well supported card and has been doing fine for me. Because of the PCI passthrough problems with the wintv card, I decided to try a USB card instead. This gave me a 'ctrl buffer too small' issue that I could solve by taking the source RPM for kvm and applying a known patch from red hat (increasing buffer size from 2048 to 8192). But then I got jerky video, probably due to USB 1.1 issues. To bypass these I could use PCI passthrough for USB. But with the PCI passthrough of this card I am again running into issues probably related to Shared IRQs. So, after all this I am back to square one. I have now modified my approach so instead of running a separate minimal host with my old server as a guest, I am now running the old server (same install) on the new hardware, using it as a host. I would definitely be interested in trying this out further in the future. I even tried Xen for a brief moment, only to realize that my host and guest felt slower (slower startup and execution) and much more difficult to handle. From the experience of the last two days fulltime trying to get things working I can only conclude that the following two features would be really important to have: * Extended PCI passthrough support o shared IRQ support Addressed by the series I sent out today. Does this mean I have a chance now that PCI passthrough of my WinTV PVR-500 might work now? What version is this and where can I get this for opensuse? I still have the setup I used for testing with the host OS still installed but not running so it would be really easy to try out new releases of KVM (it is not a serious production server after all but mainly used to run some websites and mailing lists). o supporting cases where memory is not aligned on a 4K boundary Hmm, I'm seeing warnings here when passing through one of my EHCIs, but no fatal errors. In my case, the domain just didn't start. Btw. I was using 0.12.5 on opensuse 11.3 but could only find the sources for 0.12.3 on download.opensuse.org (perhaps I looked wrong) and I patched those for th 4K issue. PCI passthrough also did not work with my wintv pci card with KVM 0.12.5. The source rpm for the 11.3 update channel is here: http://download.opensuse.org/update/11.3/rpm/src/kvm-0.12.5-1.2.1.src.rpm * USB passthrough o support USB 2.0 o support USB 3.0 (but taking one step at a time, 2.0 would also be great). Note that this will not solve any real-time issue (if that is part of your problem). E.g.: While my EHCIs work nicely in PCI-passthrough scenarios, I'm unable to use certain webcams that sooner or later run out of sync. Jan Is your point in this case that USB in a VM based on PCI passthrough will always have problems when it comes to more real-time issues or does this only apply to USB passthrough? I can imagine that PCI passthrough is better since it uses hardware support. By the way, I have seen issues in the past whereby the tv card stopped working because of high load on the server running natively so real-time issues also exist apart from virtualization. IIRC the reason that PCI passthrough with EHCI performs as badly as it does is that BARs 4k get passed through using the slow path (trap to qemu, issue MMIO in user space). Unfortunately, EHCI seems to have a 256 byte BAR region usually that is used for some handshaking: 00:12.2 USB Controller: ATI Technologies Inc SB700/SB800 USB EHCI Controller (prog-if 20 [EHCI]) Subsystem: ATI
Re: [PATCH V2] qemu,kvm: Enable user space NMI injection for kvm guest
On 12/10/2010 04:41 PM, Jan Kiszka wrote: Am 10.12.2010 08:42, Lai Jiangshan wrote: Make use of the new KVM_NMI IOCTL to send NMIs into the KVM guest if the user space raised them. (example: qemu monitor's nmi command) Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com --- diff --git a/configure b/configure index 2917874..f6f9362 100755 --- a/configure +++ b/configure @@ -1646,6 +1646,9 @@ if test $kvm != no ; then #if !defined(KVM_CAP_DESTROY_MEMORY_REGION_WORKS) #error Missing KVM capability KVM_CAP_DESTROY_MEMORY_REGION_WORKS #endif +#if !defined(KVM_CAP_USER_NMI) +#error Missing KVM capability KVM_CAP_USER_NMI +#endif int main(void) { return 0; } EOF if test $kerneldir != ; then That's what I meant. We also have a runtime check for KVM_CAP_DESTROY_MEMORY_REGION_WORKS on kvm init, but IMHO adding the same for KVM_CAP_USER_NMI would be overkill. So... diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 7dfc357..755f8c9 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -1417,6 +1417,13 @@ int kvm_arch_get_registers(CPUState *env) int kvm_arch_pre_run(CPUState *env, struct kvm_run *run) { +/* Inject NMI */ +if (env-interrupt_request CPU_INTERRUPT_NMI) { +env-interrupt_request = ~CPU_INTERRUPT_NMI; +DPRINTF(injected NMI\n); +kvm_vcpu_ioctl(env, KVM_NMI); +} + /* Try to inject an interrupt if the guest can accept it */ if (run-ready_for_interrupt_injection (env-interrupt_request CPU_INTERRUPT_HARD) Acked-by: Jan Kiszka jan.kis...@siemens.com Hi, Avi Could you apply this patch or give me any comments/suggest? Thanks, Lai -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] directed yield for Pause Loop Exiting
* Avi Kivity a...@redhat.com [2010-12-11 09:31:24]: On 12/10/2010 07:03 AM, Balbir Singh wrote: Scheduler people, please flame me with anything I may have done wrong, so I can do it right for a next version :) This is a good problem statement, there are other things to consider as well 1. If a hard limit feature is enabled underneath, donating the timeslice would probably not make too much sense in that case What's the alternative? Consider a two vcpu guest with a 50% hard cap. Suppose the workload involves ping-ponging within the guest. If the scheduler decides to schedule the vcpus without any overlap, then the throughput will be dictated by the time slice. If we allow donation, throughput is limited by context switch latency. If the vpcu holding the lock runs more and capped, the timeslice transfer is a heuristic that will not help. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/4] KVM genirq: Enable adaptive IRQ sharing for passed-through devices
On Sun, Dec 12, 2010 at 12:22:40PM +0100, Jan Kiszka wrote: The result may look simpler on first glance than v1, but it comes with more subtle race scenarios IMO. I thought them through, hopefully catching all, but I would appreciate any skeptical review. Thought about the races till my head hurt, and yes, they all seem to be handled correctly. FWIW Reviewed-by: Michael S. Tsirkin m...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices
On 12/12/2010 01:22 PM, Jan Kiszka wrote: From: Jan Kiszkajan.kis...@siemens.com PCI 2.3 allows to generically disable IRQ sources at device level. This enables us to share IRQs of such devices on the host side when passing them to a guest. However, IRQ disabling via the PCI config space is more costly than masking the line via disable_irq. Therefore we register the IRQ in adaptive mode and switch between line and device level disabling on demand. This feature is optional, user space has to request it explicitly as it also has to inform us about its view of PCI_COMMAND_INTX_DISABLE. That way, we can avoid unmasking the interrupt and signaling it if the guest masked it via the PCI config space. Looks fine. + ret =IRQ_NONE; + Danger, whitespace error detected. Initiating self-destruct sequence. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] KVM: MMU: don't make direct sp read-only if !map_writable
Currently, if the page is not allowed to write, then it can drop ACC_WRITE_MASK in pte_access, and the direct sp's access is: gw-pt_access gw-pte_access so, it also removes the write access in the direct sp. There is a problem: if the access of those pages which map thought the same mapping in guest is different in host, it causes host switch direct sp very frequently. Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/kvm/mmu.c |4 ++-- arch/x86/kvm/paging_tmpl.h | 11 ++- 2 files changed, 4 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 1a953ac..0c5cad0 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1987,6 +1987,8 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep, if (host_writable) spte |= SPTE_HOST_WRITEABLE; + else + pte_access = ~ACC_WRITE_MASK; spte |= (u64)pfn PAGE_SHIFT; @@ -2226,8 +2228,6 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write, if (iterator.level == level) { unsigned pte_access = ACC_ALL; - if (!map_writable) - pte_access = ~ACC_WRITE_MASK; mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, pte_access, 0, write, 1, pt_write, level, gfn, pfn, prefault, map_writable); diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 146b681..6ed2c5e 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -593,9 +593,6 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code, if (is_error_pfn(pfn)) return kvm_handle_bad_page(vcpu-kvm, walker.gfn, pfn); - if (!map_writable) - walker.pte_access = ~ACC_WRITE_MASK; - spin_lock(vcpu-kvm-mmu_lock); if (mmu_notifier_retry(vcpu, mmu_seq)) goto out_unlock; @@ -809,12 +806,8 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp) nr_present++; pte_access = sp-role.access FNAME(gpte_access)(vcpu, gpte); - if (!(sp-spt[i] SPTE_HOST_WRITEABLE)) { - pte_access = ~ACC_WRITE_MASK; - host_writable = 0; - } else { - host_writable = 1; - } + host_writable = !!(sp-spt[i] SPTE_HOST_WRITEABLE); + set_spte(vcpu, sp-spt[i], pte_access, 0, 0, is_dirty_gpte(gpte), PT_PAGE_TABLE_LEVEL, gfn, spte_to_pfn(sp-spt[i]), true, false, -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] KVM: MMU: audit: allow audit more guests at the same time
It only allows to audit one guest in the system since: - 'audit_point' is a glob variable - mmu_audit_disable() is called in kvm_mmu_destroy(), so audit is disabled after a guest exited this patch fix those issues then allow to audit more guests at the same time Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com --- arch/x86/include/asm/kvm_host.h |4 arch/x86/kvm/mmu.c | 27 ++- arch/x86/kvm/mmu_audit.c| 39 ++- 3 files changed, 40 insertions(+), 30 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index b55d789..6244958 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -460,6 +460,10 @@ struct kvm_arch { /* fields used by HYPER-V emulation */ u64 hv_guest_os_id; u64 hv_hypercall; + + #ifdef CONFIG_KVM_MMU_AUDIT + int audit_point; + #endif }; struct kvm_vm_stat { diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 0c5cad0..daa36ba 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3532,13 +3532,6 @@ static void mmu_destroy_caches(void) kmem_cache_destroy(mmu_page_header_cache); } -void kvm_mmu_module_exit(void) -{ - mmu_destroy_caches(); - percpu_counter_destroy(kvm_total_used_mmu_pages); - unregister_shrinker(mmu_shrinker); -} - int kvm_mmu_module_init(void) { pte_chain_cache = kmem_cache_create(kvm_pte_chain, @@ -3731,12 +3724,6 @@ int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]) } EXPORT_SYMBOL_GPL(kvm_mmu_get_spte_hierarchy); -#ifdef CONFIG_KVM_MMU_AUDIT -#include mmu_audit.c -#else -static void mmu_audit_disable(void) { } -#endif - void kvm_mmu_destroy(struct kvm_vcpu *vcpu) { ASSERT(vcpu); @@ -3744,5 +3731,19 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu) destroy_kvm_mmu(vcpu); free_mmu_pages(vcpu); mmu_free_memory_caches(vcpu); +} + +#ifdef CONFIG_KVM_MMU_AUDIT +#include mmu_audit.c +#else +static void mmu_audit_disable(void) { } +#endif + +void kvm_mmu_module_exit(void) +{ + mmu_destroy_caches(); + percpu_counter_destroy(kvm_total_used_mmu_pages); + unregister_shrinker(mmu_shrinker); mmu_audit_disable(); } + diff --git a/arch/x86/kvm/mmu_audit.c b/arch/x86/kvm/mmu_audit.c index ba2bcdd..5f6223b 100644 --- a/arch/x86/kvm/mmu_audit.c +++ b/arch/x86/kvm/mmu_audit.c @@ -19,11 +19,9 @@ #include linux/ratelimit.h -static int audit_point; - -#define audit_printk(fmt, args...) \ +#define audit_printk(kvm, fmt, args...)\ printk(KERN_ERR audit: (%s) error:\ - fmt, audit_point_name[audit_point], ##args) + fmt, audit_point_name[kvm-arch.audit_point], ##args) typedef void (*inspect_spte_fn) (struct kvm_vcpu *vcpu, u64 *sptep, int level); @@ -97,18 +95,21 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 *sptep, int level) if (sp-unsync) { if (level != PT_PAGE_TABLE_LEVEL) { - audit_printk(unsync sp: %p level = %d\n, sp, level); + audit_printk(vcpu-kvm, unsync sp: %p +level = %d\n, sp, level); return; } if (*sptep == shadow_notrap_nonpresent_pte) { - audit_printk(notrap spte in unsync sp: %p\n, sp); + audit_printk(vcpu-kvm, notrap spte in unsync +sp: %p\n, sp); return; } } if (sp-role.direct *sptep == shadow_notrap_nonpresent_pte) { - audit_printk(notrap spte in direct sp: %p\n, sp); + audit_printk(vcpu-kvm, notrap spte in direct sp: %p\n, +sp); return; } @@ -125,8 +126,9 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 *sptep, int level) hpa = pfn PAGE_SHIFT; if ((*sptep PT64_BASE_ADDR_MASK) != hpa) - audit_printk(levels %d pfn %llx hpa %llx ent %llxn, - vcpu-arch.mmu.root_level, pfn, hpa, *sptep); + audit_printk(vcpu-kvm, levels %d pfn %llx hpa %llx +ent %llxn, vcpu-arch.mmu.root_level, pfn, +hpa, *sptep); } static void inspect_spte_has_rmap(struct kvm *kvm, u64 *sptep) @@ -142,8 +144,8 @@ static void inspect_spte_has_rmap(struct kvm *kvm, u64 *sptep) if (!gfn_to_memslot(kvm, gfn)) { if (!printk_ratelimit()) return; - audit_printk(no memslot for gfn %llx\n, gfn); - audit_printk(index %ld of sp (gfn=%llx)\n, + audit_printk(kvm, no memslot for gfn %llx\n, gfn); + audit_printk(kvm, index %ld of sp (gfn=%llx)\n,
Re: [PATCH 1/2] KVM: MMU: don't make direct sp read-only if !map_writable
On 12/13/2010 12:31 PM, Xiao Guangrong wrote: Currently, if the page is not allowed to write, then it can drop ACC_WRITE_MASK in pte_access, and the direct sp's access is: gw-pt_access gw-pte_access so, it also removes the write access in the direct sp. There is a problem: if the access of those pages which map thought the same mapping in guest is different in host, it causes host switch direct sp very frequently. I just sent a patch to fix this in a different way, please review it. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB Passthrough 1.1 performance problem...
Hi, I am using a tv card in a VM and get jerky video.As I understand it, the VM is using USB 1.1. However, when I set the USB controller in the BIOS of my server to Fullspeed (12 Mbit/s) which is the USB 1.1 speed I am able to get perfect results on the host but still on the guest the video is jerky. There is a patch series from Hans de Goede on qemu-devel which adds buffering for isochronous usb transfers to the usb passthrough code. Certainly worth a try. cheers, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL net-next-2.6] vhost-net: tools, cleanups, optimizations
Please merge the following tree for 2.6.38. Thanks! The following changes since commit ad1184c6cf067a13e8cb2a4e7ccc407f947027d0: net: au1000_eth: remove unused global variable. (2010-12-11 12:01:48 -0800) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-net-next Jason Wang (1): vhost: fix typos in comment Julia Lawall (1): drivers/vhost/vhost.c: delete double assignment Michael S. Tsirkin (9): vhost: put mm after thread stop vhost-net: batch use/unuse mm vhost: copy_to_user - __copy_to_user vhost: get/put_user - __get/__put_user vhost: remove unused include vhost: correctly set bits of dirty pages vhost: better variable name in logging vhost test module tools/virtio: virtio_test tool drivers/vhost/net.c |9 +- drivers/vhost/test.c | 320 ++ drivers/vhost/test.h |7 + drivers/vhost/vhost.c| 44 +++--- drivers/vhost/vhost.h|2 +- tools/virtio/Makefile| 12 ++ tools/virtio/linux/device.h |2 + tools/virtio/linux/slab.h|2 + tools/virtio/linux/virtio.h | 223 +++ tools/virtio/vhost_test/Makefile |2 + tools/virtio/vhost_test/vhost_test.c |1 + tools/virtio/virtio_test.c | 248 ++ 12 files changed, 842 insertions(+), 30 deletions(-) create mode 100644 drivers/vhost/test.c create mode 100644 drivers/vhost/test.h create mode 100644 tools/virtio/Makefile create mode 100644 tools/virtio/linux/device.h create mode 100644 tools/virtio/linux/slab.h create mode 100644 tools/virtio/linux/virtio.h create mode 100644 tools/virtio/vhost_test/Makefile create mode 100644 tools/virtio/vhost_test/vhost_test.c create mode 100644 tools/virtio/virtio_test.c -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] SCSI Command support over VirtIO Block device
Hi 2010/12/13 Stefan Hajnoczi stefa...@gmail.com: On Dec 13, 2010 5:14 AM, अनुज anu...@gmail.com wrote: Hi I am trying to implement VirtIO support for a proprietary OS. And It would be great if I am able to process SCSI commands over VirtIO Block device. I tried to execute INQUIRY command but the status returned is UNSUPPORTED. If anyone provide example VirtIO SCSI Command request structure for INQUIRY command as per VirtIO spec Appendix D would be a great help. And also, the paragraph from VirtIO spec - 0.8.9 is confusing for me : Historically, devices assumed that the fields type, ioprio and sector reside in a single, separate read-only buffer; the fields errors, data_len, sense_len and residual reside in a single, separate write-only buffer; the sense eld in a separate write-only buffer of size 96 bytes, by itself; the fields errors, data_len, sense_len and residual in a single write-only buffer; and the status field is a separate readonly buffer of size 1 byte, by itself. Here 'status field of buffer size 1 byte' is whether readonly or writeonly? Writeonly I want to know from which version of Qemu-kvm supports processing of scsi commands over VirtIO block device as a backend. Although I checked the Host Feature fields in which VIRTIO_BLK_F_SCSI bit is set. I am using qemu-kvm version 0.12.3. Make sure you have a scsi-generic block device in qemu-kvm, not just a regular file or physical block device. Open /dev/sg. Yes, I have given a file name instead of /dev/sg0. Now it's working as a charm. That means I can use physical disk as a VirtIO disk in guest OS. right? So it's kind of passthrough for a physical disk. But how can I distinguish among different physical disks attached to the host. is /dev/sg is different for each physical disk? However I thought VirtIO scsi device operations are for virtual disk (a regular file) also. Look at hw/virtio-blk.c in qemu-kvm for host implementation details. -- Anuj Aggarwal .''`. : :Ⓐ : # apt-get install hakuna-matata `. `'` `- Thanks for your help. Regards -- Anuj Aggarwal .''`. : :Ⓐ : # apt-get install hakuna-matata `. `'` `- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: trace_printk() support in trace-cmd
(2010/12/13 2:47), Avi Kivity wrote: On 12/12/2010 07:43 PM, Arnaldo Carvalho de Melo wrote: Em Sun, Dec 12, 2010 at 07:42:06PM +0200, Avi Kivity escreveu: On 12/12/2010 07:36 PM, Arnaldo Carvalho de Melo wrote: Em Sun, Dec 12, 2010 at 06:35:24PM +0200, Avi Kivity escreveu: On 11/23/2010 05:45 PM, Steven Rostedt wrote: Again, the work around is to replace your trace_printks() with __trace_printk(_THIS_IP_, ...) or just modify the trace_printk() macro in include/linux/kernel.h to always use the __trace_printk() version. This works; I'm using it for now (I tried to use 'perf probe', but I get unpredictable results, like null pointer derefs). Can you tell us which functions, environment, etc? Something around 2.6.27-rc4; example functions are FNAME(fetch) in arch/x86/kvm/paging_tmpl.h; compiled modular (which was Steven's guess as to why it fails). (note, the failure is with trace-cmd, not /sys/kernel/debug/tracing). I mean the I tried to use 'perf probe' part. Well, same, more or less. perf probe -m kvm --add 'fetch_access=paging64_fetch pt_access=gw-pt_access pte_access=gw-pte_access dirty' would return garbage for gw-*, and the log would show the exception handler called. gw is most certainly valid. Thank you for reporting. Hmm, actually, pagefaults could happen on fetching variables. But fetching argument routines should handle it... I'd like to check it, could you tell me details? for example, that exception log, kprobe-tracer's event definition(you can see it via debugfs/tracing/kprobe-events) and the result of `perf probe -L paging64_fetch:0-10`. Best regards, -- Masami HIRAMATSU 2nd Dept. Linux Technology Center Hitachi, Ltd., Systems Development Laboratory E-mail: masami.hiramatsu...@hitachi.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] directed yield for Pause Loop Exiting
On 12/11/2010 03:57 PM, Balbir Singh wrote: * Avi Kivitya...@redhat.com [2010-12-11 09:31:24]: On 12/10/2010 07:03 AM, Balbir Singh wrote: Scheduler people, please flame me with anything I may have done wrong, so I can do it right for a next version :) This is a good problem statement, there are other things to consider as well 1. If a hard limit feature is enabled underneath, donating the timeslice would probably not make too much sense in that case What's the alternative? Consider a two vcpu guest with a 50% hard cap. Suppose the workload involves ping-ponging within the guest. If the scheduler decides to schedule the vcpus without any overlap, then the throughput will be dictated by the time slice. If we allow donation, throughput is limited by context switch latency. If the vpcu holding the lock runs more and capped, the timeslice transfer is a heuristic that will not help. Why not? as long as we shift the cap as well. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] pci: Fix PCI capabilities collision error value
On 12/09/2010 06:16 PM, Alex Williamson wrote: Signed-off-by: Alex Williamsonalex.william...@redhat.com Applied, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [qemu-kvm-next-tree] fix compile error of hw/device-assignment.c
On 12/09/2010 08:43 AM, Wei Yongjun wrote: Fix the following compile error in next tree: CCx86_64-softmmu/device-assignment.o hw/device-assignment.c: In function ‘assigned_device_pci_cap_init’: hw/device-assignment.c:1463: error: ‘PCI_PM_CTRL_NO_SOFT_RST’ undeclared (first use in this function) hw/device-assignment.c:1463: error: (Each undeclared identifier is reported only once hw/device-assignment.c:1463: error: for each function it appears in.) Applied, thanks. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: cleanup CR8 handling
On 12/08/2010 01:27 PM, Andre Przywara wrote: The handling of CR8 writes in KVM is currently somewhat cumbersome. This patch makes it look like the other CR register handlers and fixes a possible issue in VMX, where the RIP would be incremented despite an injected #GP. unsigned long kvm_get_cr8(struct kvm_vcpu *vcpu) @@ -4104,7 +4098,7 @@ static int emulator_set_cr(int cr, unsigned long val, struct kvm_vcpu *vcpu) res = kvm_set_cr4(vcpu, mk_cr_64(kvm_read_cr4(vcpu), val)); break; case 8: - res = __kvm_set_cr8(vcpu, val 0xfUL); + res = kvm_set_cr8(vcpu, val); break; default: vcpu_printf(vcpu, %s: unexpected cr %u\n, __func__, cr); Why drop the mask? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] kvm/svm: enhance MOV CR intercept handler
On 12/10/2010 03:51 PM, Andre Przywara wrote: Newer SVM implementations provide the GPR number in the VMCB, so that the emulation path is no longer necesarry to handle CR register access intercepts. Implement the handling in svm.c and use it when the info is provided. Signed-off-by: Andre Przywaraandre.przyw...@amd.com --- arch/x86/include/asm/svm.h |2 + arch/x86/kvm/svm.c | 91 ++- 2 files changed, 82 insertions(+), 11 deletions(-) diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h index 11dbca7..589fc25 100644 --- a/arch/x86/include/asm/svm.h +++ b/arch/x86/include/asm/svm.h @@ -256,6 +256,8 @@ struct __attribute__ ((__packed__)) vmcb { #define SVM_EXITINFOSHIFT_TS_REASON_JMP 38 #define SVM_EXITINFOSHIFT_TS_HAS_ERROR_CODE 44 +#define SVM_EXITINFO_REG_MASK 0x0F + #define SVM_EXIT_READ_CR0 0x000 #define SVM_EXIT_READ_CR3 0x003 #define SVM_EXIT_READ_CR4 0x004 diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 298ff79..ee5f100 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -2594,12 +2594,81 @@ static int emulate_on_interception(struct vcpu_svm *svm) return emulate_instruction(svm-vcpu, 0, 0, 0) == EMULATE_DONE; } +static int cr_interception(struct vcpu_svm *svm) +{ + int reg, cr; + unsigned long val; + int err; + + if (!static_cpu_has(X86_FEATURE_DECODEASSISTS)) + return emulate_on_interception(svm); + + /* bit 63 is the valid bit, as not all instructions (like lmsw) + provide the information */ Please use kernel style comments: /* * text * text */ Even better, use a name for the bit, which will obviate the need for a comment. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] kvm/svm: enhance mov DR intercept handler
On 12/10/2010 03:51 PM, Andre Przywara wrote: Newer SVM implementations provide the GPR number in the VMCB, so that the emulation path is no longer necesarry to handle debug register access intercepts. Implement the handling in svm.c and use it when the info is provided. + + if (!err) + skip_emulated_instruction(svm-vcpu); + else + kvm_inject_gp(svm-vcpu, 0); + This repeats, how about using complete_insn_gp()? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] kvm/svm: copy instruction bytes from VMCB
On 12/10/2010 03:51 PM, Andre Przywara wrote: In case of a nested page fault or an intercepted #PF newer SVM implementations provide a copy of the faulting instruction bytes in the VMCB. Use these bytes to feed the instruction emulator and avoid the costly guest instruction fetch in this case. +static int svm_prefetch_instruction(struct kvm_vcpu *vcpu) +{ + struct vcpu_svm *svm = to_svm(vcpu); + uint8_t len; + struct fetch_cache *fetch; + + len = svm-vmcb-control.insn_len 0x0F; + if (len == 0) + return 1; + + fetch =svm-vcpu.arch.emulate_ctxt.decode.fetch; + fetch-start = kvm_rip_read(svm-vcpu); + fetch-end = fetch-start + len; + memcpy(fetch-data, svm-vmcb-control.insn_bytes, len); + + return 0; +} This reaching in into the emulator internals from svm code is not very good. It also assumes -prefetch_instruction() is called immediately after an exit; this isn't true in vmx and at least was considered for svm (emulating multiple instructions during the nsvm vmexit sequence). Alternatives are: - add the insn data to emulate_instruction() and friends (my first suggestion) - adding x86_decode_insn_init(), which initializes the decode cache, and x86_decode_insn_prefill_cache(), called only if we have the insn data Another one: teach kvm_fetch_guest_virt() to check if addr/bytes intersects with csbase+rip/len; if so, use that instead of doing the page table dance. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] directed yield for Pause Loop Exiting
* Avi Kivity a...@redhat.com [2010-12-13 13:57:37]: On 12/11/2010 03:57 PM, Balbir Singh wrote: * Avi Kivitya...@redhat.com [2010-12-11 09:31:24]: On 12/10/2010 07:03 AM, Balbir Singh wrote: Scheduler people, please flame me with anything I may have done wrong, so I can do it right for a next version :) This is a good problem statement, there are other things to consider as well 1. If a hard limit feature is enabled underneath, donating the timeslice would probably not make too much sense in that case What's the alternative? Consider a two vcpu guest with a 50% hard cap. Suppose the workload involves ping-ponging within the guest. If the scheduler decides to schedule the vcpus without any overlap, then the throughput will be dictated by the time slice. If we allow donation, throughput is limited by context switch latency. If the vpcu holding the lock runs more and capped, the timeslice transfer is a heuristic that will not help. Why not? as long as we shift the cap as well. Shifting the cap would break it, no? Anyway, that is something for us to keep track of as we add additional heuristics, not a show stopper. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] directed yield for Pause Loop Exiting
On 12/13/2010 02:39 PM, Balbir Singh wrote: * Avi Kivitya...@redhat.com [2010-12-13 13:57:37]: On 12/11/2010 03:57 PM, Balbir Singh wrote: * Avi Kivitya...@redhat.com [2010-12-11 09:31:24]: On 12/10/2010 07:03 AM, Balbir Singh wrote: Scheduler people, please flame me with anything I may have done wrong, so I can do it right for a next version :) This is a good problem statement, there are other things to consider as well 1. If a hard limit feature is enabled underneath, donating the timeslice would probably not make too much sense in that case What's the alternative? Consider a two vcpu guest with a 50% hard cap. Suppose the workload involves ping-ponging within the guest. If the scheduler decides to schedule the vcpus without any overlap, then the throughput will be dictated by the time slice. If we allow donation, throughput is limited by context switch latency. If the vpcu holding the lock runs more and capped, the timeslice transfer is a heuristic that will not help. Why not? as long as we shift the cap as well. Shifting the cap would break it, no? The total cap for the guest would remain. Anyway, that is something for us to keep track of as we add additional heuristics, not a show stopper. Sure, as long as we see a way to fix it eventually. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
In what order are the CPUs discovered?
Hi Where can I find out, the order in which the CPUs are discovered? When having: - Multiple sockets. - Multiple cores. - Hyper-threading (HTT). E.g. a single Socket with two cores and HTT enabled on both cores. This would be 4 CPUs. Would cpu0 and cpu1 be the first core, and cpu2 and 3 the second core? What happens if HTT is disabled on core0 and enabled on core1? Would I see cpu0, cpu2,cpu3? or would it be cpu0, cpu1, cpu2 ? Any suggestions on where I could find information on this would be appreciated. Thanks Henry -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] Do not register kvmclock savevm section if kvmclock is disabled.
On Wed, 2010-12-08 at 17:31 -0200, Marcelo Tosatti wrote: On Tue, Dec 07, 2010 at 03:12:36PM -0200, Glauber Costa wrote: On Mon, 2010-12-06 at 19:04 -0200, Marcelo Tosatti wrote: On Mon, Dec 06, 2010 at 09:03:46AM -0500, Glauber Costa wrote: Usually nobody usually thinks about that scenario (me included and specially), but kvmclock can be actually disabled in the host. It happens in two scenarios: 1. host too old. 2. we passed -kvmclock to our -cpu parameter. In both cases, we should not register kvmclock savevm section. This patch achives that by registering this section only if kvmclock is actually currently enabled in cpuid. The only caveat is that we have to register the savevm section a little bit later, since we won't know the final kvmclock state before cpuid gets parsed. What is the problem of registering the section? Restoring the value if the host does not support it returns an error? Can't you ignore the error if kvmclock is not reported in cpuid, in the restore handler? We can change the restore handler, but not the restore handler of binaries that are already out there. The motivation here is precisely to address migration to hosts without kvmclock, so it's better to have a way to disable, than to count on the fact that the other side will be able to ignore it. OK. Can't you register conditionally on kvmclock cpuid bit at the end of kvm_arch_init_vcpu, in target-i386/kvm.c? Haven't looked at it, but will today. Actually, tsc has (obviously) the same problem and I plan to respin the patch today including a fix for it as well. Thanks! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: Correct kvm_pio tracepoint count field
Currently, we record '1' for count regardless of the real count. Fix. Signed-off-by: Avi Kivity a...@redhat.com --- arch/x86/kvm/x86.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8d76150..cf5fab1 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3948,7 +3948,7 @@ static int emulator_pio_in_emulated(int size, unsigned short port, void *val, if (vcpu-arch.pio.count) goto data_avail; - trace_kvm_pio(0, port, size, 1); + trace_kvm_pio(0, port, size, count); vcpu-arch.pio.port = port; vcpu-arch.pio.in = 1; @@ -3976,7 +3976,7 @@ static int emulator_pio_out_emulated(int size, unsigned short port, const void *val, unsigned int count, struct kvm_vcpu *vcpu) { - trace_kvm_pio(1, port, size, 1); + trace_kvm_pio(1, port, size, count); vcpu-arch.pio.port = port; vcpu-arch.pio.in = 0; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: trace_printk() support in trace-cmd
On Sun, 2010-12-12 at 18:10 +0200, Avi Kivity wrote: On 11/23/2010 12:52 PM, Avi Kivity wrote: I see a trace_printk() commit in trace-cmd.git. Is that related? If not, I'll work on getting a small sample of the problem. Sample: http://people.redhat.com/akivity/trace.dat.bz2 You said previously that /debug/tracing/printk_formats was empty? This is the problem. It uses this file to map what the format of the printk is to what is being printed. But if we don't have this mapping, trace-cmd (nor perf) can not figure this out. You are using the latest kernel for this? What's your work flow? Do you load kvm modules after you start the trace, or are they always loaded? Are the trace_printk's in the core kernel too, and not being printed? Thanks, -- Steve -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: trace_printk() support in trace-cmd
On 12/13/2010 05:26 PM, Steven Rostedt wrote: On Sun, 2010-12-12 at 18:10 +0200, Avi Kivity wrote: On 11/23/2010 12:52 PM, Avi Kivity wrote: I see a trace_printk() commit in trace-cmd.git. Is that related? If not, I'll work on getting a small sample of the problem. Sample: http://people.redhat.com/akivity/trace.dat.bz2 You said previously that /debug/tracing/printk_formats was empty? Still the case. This is the problem. It uses this file to map what the format of the printk is to what is being printed. But if we don't have this mapping, trace-cmd (nor perf) can not figure this out. You are using the latest kernel for this? 2.6.37-rc5 plus a bunch of kvm patches. What's your work flow? Do you load kvm modules after you start the trace, or are they always loaded? Loaded on boot. Are the trace_printk's in the core kernel too, and not being printed? I don't have any trace_printk()s in the core kernel, only in modules. Perhaps module initialization does not communicate trace_printk formats? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: trace_printk() support in trace-cmd
On Mon, 2010-12-13 at 17:43 +0200, Avi Kivity wrote: What's your work flow? Do you load kvm modules after you start the trace, or are they always loaded? Loaded on boot. Via initramfs? Are the trace_printk's in the core kernel too, and not being printed? I don't have any trace_printk()s in the core kernel, only in modules. Perhaps module initialization does not communicate trace_printk formats? They should. Could you send me a patch that has the trace_printk()s you are using. Thanks, -- Steve -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] tools/virtio: virtio_test tool
On Mon, Dec 06, 2010 at 02:37:05PM -0200, Thiago Farina wrote: On Mon, Nov 29, 2010 at 3:16 PM, Michael S. Tsirkin m...@redhat.com wrote: +#define container_of(ptr, type, member) ({ \ + const typeof( ((type *)0)-member ) *__mptr = (ptr); \ + (type *)( (char *)__mptr - offsetof(type,member) );}) + +#define uninitialized_var(x) x = x + +# ifndef likely +# define likely(x) (__builtin_expect(!!(x), 1)) +# endif +# ifndef unlikely +# define unlikely(x) (__builtin_expect(!!(x), 0)) +# endif It seems you are not using these macros. Do you really need them here? They are used by virtio that I'm compiling in userspace here. Can't you include the right linux header files for these macros instead? Far from trivial as linux headers aren't intended to be built in userspace, if you try you get all kind of conflicts with libc headers etc. If you see a way to do this, pls send me a patch. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] directed yield for Pause Loop Exiting
On 12/11/2010 08:57 AM, Balbir Singh wrote: If the vpcu holding the lock runs more and capped, the timeslice transfer is a heuristic that will not help. That indicates you really need the cap to be per guest, and not per VCPU. Having one VCPU spin on a lock (and achieve nothing), because the other one cannot give up the lock due to hitting its CPU cap could lead to showstoppingly bad performance. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: trace_printk() support in trace-cmd
On 12/13/2010 06:28 PM, Steven Rostedt wrote: On Mon, 2010-12-13 at 17:43 +0200, Avi Kivity wrote: What's your work flow? Do you load kvm modules after you start the trace, or are they always loaded? Loaded on boot. Via initramfs? No, regular printks. Are the trace_printk's in the core kernel too, and not being printed? I don't have any trace_printk()s in the core kernel, only in modules. Perhaps module initialization does not communicate trace_printk formats? They should. Could you send me a patch that has the trace_printk()s you are using. Attached (with __trace_printk()s, which is what I used). -- error compiling committee.c: too many arguments to function diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index d75ba1e..df86917 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1449,6 +1449,10 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, if (role.direct) role.cr4_pae = 0; role.access = access; + __trace_printk(_THIS_IP_, + base_role %x access %x role.access %x role %x\n, + vcpu-arch.mmu.base_role, access, role.access, + role.word); if (!vcpu-arch.mmu.direct_map vcpu-arch.mmu.root_level = PT32_ROOT_LEVEL) { quadrant = gaddr (PAGE_SHIFT + (PT64_PT_BITS * level)); @@ -1576,6 +1580,11 @@ static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, if (child-role.access == direct_access) return; + __trace_printk(_THIS_IP_, + child-role %x child-role.access %x direct_access %x\n, + child-role.word, child-role.access, + direct_access); + mmu_page_remove_parent_pte(child, sptep); __set_spte(sptep, shadow_trap_nonpresent_pte); kvm_flush_remote_tlbs(vcpu-kvm); diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 4f61fbb..1049729 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -450,6 +450,8 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, if (!is_present_gpte(gw-ptes[gw-level - 1])) return NULL; + __trace_printk(_THIS_IP_, pt_access %x pte_access %x dirty %d\n, + gw-pt_access, gw-pte_access, dirty); direct_access = gw-pt_access gw-pte_access; if (!dirty) direct_access = ~ACC_WRITE_MASK; @@ -592,6 +594,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code, if (is_error_pfn(pfn)) return kvm_handle_bad_page(vcpu-kvm, walker.gfn, pfn); + __trace_printk(_THIS_IP_, page_fault: map_writeable %x\n, + map_writable); + spin_lock(vcpu-kvm-mmu_lock); if (mmu_notifier_retry(vcpu, mmu_seq)) goto out_unlock; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 83f5bf6..05481a3 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1015,6 +1015,8 @@ static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr, bool atomic, if (unlikely(npages != 1) !atomic) { might_sleep(); + __trace_printk(_THIS_IP_, %s: addr %lx not writeable\n, + __func__, addr); if (writable) *writable = write_fault;
Re: trace_printk() support in trace-cmd
On 12/13/2010 07:05 PM, Avi Kivity wrote: On 12/13/2010 06:28 PM, Steven Rostedt wrote: On Mon, 2010-12-13 at 17:43 +0200, Avi Kivity wrote: What's your work flow? Do you load kvm modules after you start the trace, or are they always loaded? Loaded on boot. Via initramfs? No, regular printks. Regular modprobe. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] tools/virtio: virtio_test tool
On Mon, Dec 06, 2010 at 03:23:02PM +1030, Rusty Russell wrote: On Tue, 30 Nov 2010 03:46:37 am Michael S. Tsirkin wrote: This is the userspace part of the tool: it includes a bunch of stubs for linux APIs, somewhat simular to linuxsched. This makes it possible to recompile the ring code in userspace. A small test example is implemented combining this with vhost_test module. Signed-off-by: Michael S. Tsirkin m...@redhat.com Hi Michael, I'm not sure what the point is of this work? You'll still need to benchmark on real systems, but it's not low-level enough to measure things like cache misses. The point is to be able to create easy to test workloads: (just running the single test included here produces a result that seems repeatable to a high degree) while still staying as close as possible to what we might expect in real life. I also want to be able to measure just the overhead of the ring, without involving block or network core in guest and host. In other words, it's a synthetic benchmark. I'm assuming you're thinking of playing with layout to measure cache behaviour. In one example, using this test I saw that different publish used index layouts don't seem to behave at all differently. But I also saw that the extra pointer hasing added by my publish used index patches did add measureable overhead. Plan to look into that. I was thinking of a complete userspace implementation The disadvantage is that any work done there needs to be redone in real life, though. And implementation details often matter. What I did let me actually use the virtio/vhost code that we have and see how it performs. where either it was run under cachegrind, or each access was wrapped to allow tracking of cachelines to give an exact measure of cache movement perf stat not good enough? under various scenarios (esp. ring mostly empty, ring in steady state, ring mostly full). Yes, I do want to add tests to stress various scenarios. Cheers, Rusty. -- MST -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL net-next-2.6] vhost-net: tools, cleanups, optimizations
On Mon, Dec 13, 2010 at 12:44:13PM +0200, Michael S. Tsirkin wrote: Please merge the following tree for 2.6.38. Thanks! Um, I sent this out before I noticed the mail from Rusty with some questions on the test code. I missed that and assumed no comments - no issues, perhaps wrongly. Rusty - I tried answering the questions there - any issues with merging this? It's just a test so won't be hard to remove later if it's not helpful ... The following changes since commit ad1184c6cf067a13e8cb2a4e7ccc407f947027d0: net: au1000_eth: remove unused global variable. (2010-12-11 12:01:48 -0800) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-net-next Jason Wang (1): vhost: fix typos in comment Julia Lawall (1): drivers/vhost/vhost.c: delete double assignment Michael S. Tsirkin (9): vhost: put mm after thread stop vhost-net: batch use/unuse mm vhost: copy_to_user - __copy_to_user vhost: get/put_user - __get/__put_user vhost: remove unused include vhost: correctly set bits of dirty pages vhost: better variable name in logging vhost test module tools/virtio: virtio_test tool drivers/vhost/net.c |9 +- drivers/vhost/test.c | 320 ++ drivers/vhost/test.h |7 + drivers/vhost/vhost.c| 44 +++--- drivers/vhost/vhost.h|2 +- tools/virtio/Makefile| 12 ++ tools/virtio/linux/device.h |2 + tools/virtio/linux/slab.h|2 + tools/virtio/linux/virtio.h | 223 +++ tools/virtio/vhost_test/Makefile |2 + tools/virtio/vhost_test/vhost_test.c |1 + tools/virtio/virtio_test.c | 248 ++ 12 files changed, 842 insertions(+), 30 deletions(-) create mode 100644 drivers/vhost/test.c create mode 100644 drivers/vhost/test.h create mode 100644 tools/virtio/Makefile create mode 100644 tools/virtio/linux/device.h create mode 100644 tools/virtio/linux/slab.h create mode 100644 tools/virtio/linux/virtio.h create mode 100644 tools/virtio/vhost_test/Makefile create mode 100644 tools/virtio/vhost_test/vhost_test.c create mode 100644 tools/virtio/virtio_test.c -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Freezing Windows 2008 x64bit guest
Gleb Natapov gleb at redhat.com writes: On Wed, Jul 28, 2010 at 12:53:02AM +0300, Harri Olin wrote: Gleb Natapov wrote: On Wed, Jul 21, 2010 at 09:25:31AM +0300, Harri Olin wrote: Gleb Natapov kirjoitti: On Mon, Jul 19, 2010 at 10:17:02AM +0300, Harri Olin wrote: Gleb Natapov kirjoitti: On Thu, Jul 15, 2010 at 03:19:44PM +0200, Christoph Adomeit wrote: But one Windows 2008 64 Bit Server Standard is freezing regularly. This happens sometimes 3 times a day, sometimes it takes 2 days until freeze. The Windows Machine is a clean fresh install. I think I have seen same problem occur on my Windows 2008 SBS SP2 64bit system, but a bit less often, only like once a week. Now I haven't seen crashes but only freezes with qemu on 100% and virtual system unresponsive. Does sendkey from monitor works? qemu-kvm-0.11.1 is very old and this is not total freeze which even harder to debug. I don't see anything extraordinary in your logs. 4643 interrupt per second for 4 cpus is normal if windows runs multimedia or other app that need hi-res timers. Does your host swapping? Is there any chance that you can try upstream qemu-kvm? I tried running qemu-kvm from git but it exhibited the same problem as 12.x that I tried before, BSODing once in a while, running kernel 2.6.34.1. That should be pretty stable config, although it would be nice if you could try running in qemy-kvm.git head. sample BSOD failure details: These two with Realtec nic and qemu cpu 0x0019 (0x0020, 0xf88007e65970, 0xf88007e65990, 0x0502040f) 0x0019 (0x0020, 0xf88007a414c0, 0xf88007a414e0, 0x0502044c) These are with e1000 and -cpu host 0x003b (0xc005, 0xf80001c5d842, 0xfa60093ddb70, 0x) 0x003b (0xc005, 0xf80001cb8842, 0xfa600c94ab70, 0x) 0x000a (0x0080, 0x000c, 0x0001, 0xf80001cadefd) Can you attach screenshots of BSODs? Have you reinstalled your guests or are you running the same images you ran in 11.x? I'll see if I can analyze minidumps later. In addition to these there have been as many reboots that have been only logged as 'disruptive shutdown'. Right now I'm running the problematic guest under Xen 3.2.1-something from Debian to see if it works better. -- Harri. Hello, is there a solution for that problem? I'm experiencing the same problems ever since I installed SBS 2008 on KVM. I was running the host with Ubuntu 10.04 but upgraded to 10.10 - mainly because of performance problems which were solved by the upgrade. After the upgrade the system became extremly unstable. It was crashing as soon as disk io and network io load was growing. 100% reproduceable with windows server backup to an iscsi volume. i had virtio drivers for storage and network installed (redhat/fedora 1.1.11). At each BSOD I had the following line in the log of the guest: virtio_ioport_write: unexpected address 0x13 value 0x1 I changed the network interface back to e1000. What I experience now (and I had that a the very beginning before i switched to virtio network) are freezes. The guest doesn't respond anymore (doesn't answer to pings and doesn't interact via mouse/keyboard anymore). Host CPU usage of the kvm process is 100% on as many cores as there are virtual cpus (in this case 4). I'm a bit frustrated about this. I have 2 windows 2003 32bit, 1 windows xp and 3 linux guests (2x 32bit, 1x64 bit). They are all running without any problems (except that the windows xp guest cannot boot without an ntldr cd image). Only the SBS2008 guest regulary freezes. The host system has 2 Intel Xeon 5504, Intel Chipset 5500, Adaptec Raid 5805, 24 GB DDR3 RAM. I know there is a lack of detailed information right now. I first need to know if anybody is working on this or has similar problems. I can deliver minidumps, and any debugging information you need. I don't want to give up now. We will switch to Hyper-V if we cannot solve this, because we need a stable virtualization plattform for Windows Guests. I would like to use KVM it is so much more flexibel. Best regards Manfred -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Freezing Windows 2008 x64bit guest
On 12/13/2010 09:42 PM, Manfred Heubach wrote: Gleb Natapovglebat redhat.com writes: On Wed, Jul 28, 2010 at 12:53:02AM +0300, Harri Olin wrote: Gleb Natapov wrote: On Wed, Jul 21, 2010 at 09:25:31AM +0300, Harri Olin wrote: Gleb Natapov kirjoitti: On Mon, Jul 19, 2010 at 10:17:02AM +0300, Harri Olin wrote: Gleb Natapov kirjoitti: On Thu, Jul 15, 2010 at 03:19:44PM +0200, Christoph Adomeit wrote: But one Windows 2008 64 Bit Server Standard is freezing regularly. This happens sometimes 3 times a day, sometimes it takes 2 days until freeze. The Windows Machine is a clean fresh install. I think I have seen same problem occur on my Windows 2008 SBS SP2 64bit system, but a bit less often, only like once a week. Now I haven't seen crashes but only freezes with qemu on 100% and virtual system unresponsive. Does sendkey from monitor works? qemu-kvm-0.11.1 is very old and this is not total freeze which even harder to debug. I don't see anything extraordinary in your logs. 4643 interrupt per second for 4 cpus is normal if windows runs multimedia or other app that need hi-res timers. Does your host swapping? Is there any chance that you can try upstream qemu-kvm? I tried running qemu-kvm from git but it exhibited the same problem as 12.x that I tried before, BSODing once in a while, running kernel 2.6.34.1. That should be pretty stable config, although it would be nice if you could try running in qemy-kvm.git head. sample BSOD failure details: These two with Realtec nic and qemu cpu 0x0019 (0x0020, 0xf88007e65970, 0xf88007e65990, 0x0502040f) 0x0019 (0x0020, 0xf88007a414c0, 0xf88007a414e0, 0x0502044c) These are with e1000 and -cpu host 0x003b (0xc005, 0xf80001c5d842, 0xfa60093ddb70, 0x) 0x003b (0xc005, 0xf80001cb8842, 0xfa600c94ab70, 0x) 0x000a (0x0080, 0x000c, 0x0001, 0xf80001cadefd) Can you attach screenshots of BSODs? Have you reinstalled your guests or are you running the same images you ran in 11.x? I'll see if I can analyze minidumps later. In addition to these there have been as many reboots that have been only logged as 'disruptive shutdown'. Right now I'm running the problematic guest under Xen 3.2.1-something from Debian to see if it works better. -- Harri. Hello, is there a solution for that problem? I'm experiencing the same problems ever since I installed SBS 2008 on KVM. I was running the host with Ubuntu 10.04 but upgraded to 10.10 - mainly because of performance problems which were solved by the upgrade. After the upgrade the system became extremly unstable. It was crashing as soon as disk io and network io load was growing. 100% reproduceable with windows server backup to an iscsi volume. i had virtio drivers for storage and network installed (redhat/fedora 1.1.11). Which fedora/rhel release is that? What's the windows virtio driver version? Have you tried using virt-manager/virhs instead of raw cmdline? About e1000, some windows comes with buggy driver and an update e1000 from Intel fixes some issues. At each BSOD I had the following line in the log of the guest: virtio_ioport_write: unexpected address 0x13 value 0x1 I changed the network interface back to e1000. What I experience now (and I had that a the very beginning before i switched to virtio network) are freezes. The guest doesn't respond anymore (doesn't answer to pings and doesn't interact via mouse/keyboard anymore). Host CPU usage of the kvm process is 100% on as many cores as there are virtual cpus (in this case 4). I'm a bit frustrated about this. I have 2 windows 2003 32bit, 1 windows xp and 3 linux guests (2x 32bit, 1x64 bit). They are all running without any problems (except that the windows xp guest cannot boot without an ntldr cd image). Only the SBS2008 guest regulary freezes. The host system has 2 Intel Xeon 5504, Intel Chipset 5500, Adaptec Raid 5805, 24 GB DDR3 RAM. I know there is a lack of detailed information right now. I first need to know if anybody is working on this or has similar problems. I can deliver minidumps, and any debugging information you need. I don't want to give up now. We will switch to Hyper-V if we cannot solve this, because we need a stable virtualization plattform for Windows Guests. I would like to use KVM it is so much more flexibel. Best regards Manfred -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Freezing Windows 2008 x64bit guest
On Mon, 2010-12-13 at 22:12 +0200, Dor Laor wrote: On 12/13/2010 09:42 PM, Manfred Heubach wrote: Gleb Natapovglebat redhat.com writes: On Wed, Jul 28, 2010 at 12:53:02AM +0300, Harri Olin wrote: Gleb Natapov wrote: On Wed, Jul 21, 2010 at 09:25:31AM +0300, Harri Olin wrote: Gleb Natapov kirjoitti: On Mon, Jul 19, 2010 at 10:17:02AM +0300, Harri Olin wrote: Gleb Natapov kirjoitti: On Thu, Jul 15, 2010 at 03:19:44PM +0200, Christoph Adomeit wrote: But one Windows 2008 64 Bit Server Standard is freezing regularly. This happens sometimes 3 times a day, sometimes it takes 2 days until freeze. The Windows Machine is a clean fresh install. I think I have seen same problem occur on my Windows 2008 SBS SP2 64bit system, but a bit less often, only like once a week. Now I haven't seen crashes but only freezes with qemu on 100% and virtual system unresponsive. Does sendkey from monitor works? qemu-kvm-0.11.1 is very old and this is not total freeze which even harder to debug. I don't see anything extraordinary in your logs. 4643 interrupt per second for 4 cpus is normal if windows runs multimedia or other app that need hi-res timers. Does your host swapping? Is there any chance that you can try upstream qemu-kvm? I tried running qemu-kvm from git but it exhibited the same problem as 12.x that I tried before, BSODing once in a while, running kernel 2.6.34.1. That should be pretty stable config, although it would be nice if you could try running in qemy-kvm.git head. sample BSOD failure details: These two with Realtec nic and qemu cpu 0x0019 (0x0020, 0xf88007e65970, 0xf88007e65990, 0x0502040f) 0x0019 (0x0020, 0xf88007a414c0, 0xf88007a414e0, 0x0502044c) These are with e1000 and -cpu host 0x003b (0xc005, 0xf80001c5d842, 0xfa60093ddb70, 0x) 0x003b (0xc005, 0xf80001cb8842, 0xfa600c94ab70, 0x) 0x000a (0x0080, 0x000c, 0x0001, 0xf80001cadefd) Can you attach screenshots of BSODs? Have you reinstalled your guests or are you running the same images you ran in 11.x? I'll see if I can analyze minidumps later. In addition to these there have been as many reboots that have been only logged as 'disruptive shutdown'. Right now I'm running the problematic guest under Xen 3.2.1-something from Debian to see if it works better. -- Harri. Hello, is there a solution for that problem? I'm experiencing the same problems ever since I installed SBS 2008 on KVM. I was running the host with Ubuntu 10.04 but upgraded to 10.10 - mainly because of performance problems which were solved by the upgrade. After the upgrade the system became extremly unstable. It was crashing as soon as disk io and network io load was growing. 100% reproduceable with windows server backup to an iscsi volume. i had virtio drivers for storage and network installed (redhat/fedora 1.1.11). Which fedora/rhel release is that? What's the windows virtio driver version? Have you tried using virt-manager/virhs instead of raw cmdline? About e1000, some windows comes with buggy driver and an update e1000 from Intel fixes some issues. At each BSOD I had the following line in the log of the guest: virtio_ioport_write: unexpected address 0x13 value 0x1 I changed the network interface back to e1000. What I experience now (and I had that a the very beginning before i switched to virtio network) are freezes. The guest doesn't respond anymore (doesn't answer to pings and doesn't interact via mouse/keyboard anymore). Host CPU usage of the kvm process is 100% on as many cores as there are virtual cpus (in this case 4). Sounds like an interrupt storm to me. Can you try to ping your VM? Anyway the best way to start debugging a stalled system is just to crash it with BSOD. For doing it you will need: - enable NMICrashDump (please see http://support.microsoft.com/kb/927069 for more information - enable Kernel Memory Dump (actually Complete is much better, but it can be too big) http://support.microsoft.com/kb/969028 - you only will need to type nmi 0 in the qemu monitor to crash the system, when the system hangs next time. Best regards, Vadim. I'm a bit frustrated about this. I have 2 windows 2003 32bit, 1 windows xp and 3 linux guests (2x 32bit, 1x64 bit). They are all running without any problems (except that the windows xp guest cannot boot without an ntldr cd image). Only the SBS2008 guest regulary freezes. The host system has 2 Intel Xeon 5504, Intel Chipset 5500, Adaptec Raid 5805, 24 GB DDR3 RAM. I know there is a lack of detailed information right now. I first need to know if anybody is working on this or has similar problems. I can deliver minidumps,
[RESEND PATCH v3 0/2] Minimal RAM API support
No comments since v3, please apply. Thanks, Alex v3: - Address review comments - pc registers all memory below 4G in one chunk Let me know if there are any further issues. v2: - Move to Makefile.objs - Move structures to memory.c and create a callback function - Fix memory leak I haven't moved to the state parameter because there should only be a single instance of this per VM. The state parameter seems like it would add complications in setup and function calling, but maybe point me to an example if I'm off base. v1: For VFIO based device assignment, we need to know what guest memory areas are actual RAM. RAMBlocks have long since become a grab bag of misc allocations, so aren't effective for this. Anthony has had a RAM API in mind for a while now that addresses this problem. This implements just enough of it so that we have an interface to get actual guest memory physical addresses to setup the host IOMMU. We can continue building a full RAM API on top of this stub. Anthony, feel free to add copyright to memory.c as it's based on your initial implementation. I had to add something since the file in your branch just copies a header with Frabrice's copywrite. --- Alex Williamson (2): RAM API: Make use of it for x86 PC Minimal RAM API support Makefile.objs |1 + cpu-common.h |2 + hw/pc.c |9 ++--- memory.c | 97 + memory.h | 44 ++ 5 files changed, 147 insertions(+), 6 deletions(-) create mode 100644 memory.c create mode 100644 memory.h -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RESEND PATCH v3 1/2] Minimal RAM API support
This adds a minimum chunk of Anthony's RAM API support so that we can identify actual VM RAM versus all the other things that make use of qemu_ram_alloc. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Makefile.objs |1 + cpu-common.h |2 + memory.c | 97 + memory.h | 44 ++ 4 files changed, 144 insertions(+), 0 deletions(-) create mode 100644 memory.c create mode 100644 memory.h diff --git a/Makefile.objs b/Makefile.objs index cebb945..47f3c3a 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -172,6 +172,7 @@ hw-obj-y += pci.o pci_bridge.o msix.o msi.o hw-obj-$(CONFIG_PCI) += pci_host.o pcie_host.o hw-obj-$(CONFIG_PCI) += ioh3420.o xio3130_upstream.o xio3130_downstream.o hw-obj-y += watchdog.o +hw-obj-y += memory.o hw-obj-$(CONFIG_ISA_MMIO) += isa_mmio.o hw-obj-$(CONFIG_ECC) += ecc.o hw-obj-$(CONFIG_NAND) += nand.o diff --git a/cpu-common.h b/cpu-common.h index 6d4a898..f08f93b 100644 --- a/cpu-common.h +++ b/cpu-common.h @@ -29,6 +29,8 @@ enum device_endian { /* address in the RAM (different from a physical address) */ typedef unsigned long ram_addr_t; +#include memory.h + /* memory API */ typedef void CPUWriteMemoryFunc(void *opaque, target_phys_addr_t addr, uint32_t value); diff --git a/memory.c b/memory.c new file mode 100644 index 000..742776f --- /dev/null +++ b/memory.c @@ -0,0 +1,97 @@ +/* + * RAM API + * + * Copyright Red Hat, Inc. 2010 + * + * Authors: + * Alex Williamson alex.william...@redhat.com + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + */ +#include memory.h +#include range.h + +typedef struct ram_slot { +target_phys_addr_t start_addr; +ram_addr_t size; +ram_addr_t offset; +QLIST_ENTRY(ram_slot) next; +} ram_slot; + +static QLIST_HEAD(ram_slots, ram_slot) ram_slots = +QLIST_HEAD_INITIALIZER(ram_slots); + +static ram_slot *qemu_ram_find_slot(target_phys_addr_t start_addr, + ram_addr_t size) +{ +ram_slot *slot; + +QLIST_FOREACH(slot, ram_slots, next) { +if (slot-start_addr == start_addr slot-size == size) { +return slot; +} + +if (ranges_overlap(start_addr, size, slot-start_addr, slot-size)) { +hw_error(Ram range overlaps existing slot\n); +} +} + +return NULL; +} + +int qemu_ram_register(target_phys_addr_t start_addr, ram_addr_t size, + ram_addr_t phys_offset) +{ +ram_slot *slot; + +if (!size) { +return -EINVAL; +} + +assert(!qemu_ram_find_slot(start_addr, size)); + +slot = qemu_mallocz(sizeof(ram_slot)); + +slot-start_addr = start_addr; +slot-size = size; +slot-offset = phys_offset; + +QLIST_INSERT_HEAD(ram_slots, slot, next); + +cpu_register_physical_memory(slot-start_addr, slot-size, slot-offset); + +return 0; +} + +void qemu_ram_unregister(target_phys_addr_t start_addr, ram_addr_t size) +{ +ram_slot *slot; + +if (!size) { +return; +} + +slot = qemu_ram_find_slot(start_addr, size); +assert(slot != NULL); + +QLIST_REMOVE(slot, next); +qemu_free(slot); +cpu_register_physical_memory(start_addr, size, IO_MEM_UNASSIGNED); + +return; +} + +int qemu_ram_for_each_slot(void *opaque, qemu_ram_for_each_slot_fn fn) +{ +ram_slot *slot; + +QLIST_FOREACH(slot, ram_slots, next) { +int ret = fn(opaque, slot-start_addr, slot-size, slot-offset); +if (ret) { +return ret; +} +} +return 0; +} diff --git a/memory.h b/memory.h new file mode 100644 index 000..e7aa5cb --- /dev/null +++ b/memory.h @@ -0,0 +1,44 @@ +#ifndef QEMU_MEMORY_H +#define QEMU_MEMORY_H +/* + * RAM API + * + * Copyright Red Hat, Inc. 2010 + * + * Authors: + * Alex Williamson alex.william...@redhat.com + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + */ + +#include qemu-common.h +#include cpu-common.h + +typedef int (*qemu_ram_for_each_slot_fn)(void *opaque, + target_phys_addr_t start_addr, + ram_addr_t size, + ram_addr_t phys_offset); + +/** + * qemu_ram_register() : Register a region of guest physical memory + * + * The new region must not overlap an existing region. + */ +int qemu_ram_register(target_phys_addr_t start_addr, ram_addr_t size, + ram_addr_t phys_offset); + +/** + * qemu_ram_unregister() : Unregister a region of guest physical memory + */ +void qemu_ram_unregister(target_phys_addr_t start_addr, ram_addr_t size); + +/** + * qemu_ram_for_each_slot() : Call fn() on each registered region + * + * Stop on non-zero return from fn(). + */ +int qemu_ram_for_each_slot(void *opaque,
[RESEND PATCH v3 2/2] RAM API: Make use of it for x86 PC
Register the actual VM RAM using the new API Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/pc.c |9 +++-- 1 files changed, 3 insertions(+), 6 deletions(-) diff --git a/hw/pc.c b/hw/pc.c index e1b2667..1554164 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -913,14 +913,11 @@ void pc_memory_init(ram_addr_t ram_size, /* allocate RAM */ ram_addr = qemu_ram_alloc(NULL, pc.ram, below_4g_mem_size + above_4g_mem_size); -cpu_register_physical_memory(0, 0xa, ram_addr); -cpu_register_physical_memory(0x10, - below_4g_mem_size - 0x10, - ram_addr + 0x10); +qemu_ram_register(0, below_4g_mem_size, ram_addr); #if TARGET_PHYS_ADDR_BITS 32 if (above_4g_mem_size 0) { -cpu_register_physical_memory(0x1ULL, above_4g_mem_size, - ram_addr + below_4g_mem_size); +qemu_ram_register(0x1ULL, above_4g_mem_size, + ram_addr + below_4g_mem_size); } #endif -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RESEND PATCH v3 1/2] Minimal RAM API support
On 12/13/2010 02:47 PM, Alex Williamson wrote: This adds a minimum chunk of Anthony's RAM API support so that we can identify actual VM RAM versus all the other things that make use of qemu_ram_alloc. Signed-off-by: Alex Williamsonalex.william...@redhat.com --- Makefile.objs |1 + cpu-common.h |2 + memory.c | 97 + memory.h | 44 ++ 4 files changed, 144 insertions(+), 0 deletions(-) create mode 100644 memory.c create mode 100644 memory.h diff --git a/Makefile.objs b/Makefile.objs index cebb945..47f3c3a 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -172,6 +172,7 @@ hw-obj-y += pci.o pci_bridge.o msix.o msi.o hw-obj-$(CONFIG_PCI) += pci_host.o pcie_host.o hw-obj-$(CONFIG_PCI) += ioh3420.o xio3130_upstream.o xio3130_downstream.o hw-obj-y += watchdog.o +hw-obj-y += memory.o hw-obj-$(CONFIG_ISA_MMIO) += isa_mmio.o hw-obj-$(CONFIG_ECC) += ecc.o hw-obj-$(CONFIG_NAND) += nand.o diff --git a/cpu-common.h b/cpu-common.h index 6d4a898..f08f93b 100644 --- a/cpu-common.h +++ b/cpu-common.h @@ -29,6 +29,8 @@ enum device_endian { /* address in the RAM (different from a physical address) */ typedef unsigned long ram_addr_t; +#include memory.h + /* memory API */ typedef void CPUWriteMemoryFunc(void *opaque, target_phys_addr_t addr, uint32_t value); diff --git a/memory.c b/memory.c new file mode 100644 index 000..742776f --- /dev/null +++ b/memory.c @@ -0,0 +1,97 @@ +/* + * RAM API + * + * Copyright Red Hat, Inc. 2010 + * + * Authors: + * Alex Williamsonalex.william...@redhat.com + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + */ +#include memory.h +#include range.h + +typedef struct ram_slot { +target_phys_addr_t start_addr; +ram_addr_t size; +ram_addr_t offset; +QLIST_ENTRY(ram_slot) next; +} ram_slot; + +static QLIST_HEAD(ram_slots, ram_slot) ram_slots = +QLIST_HEAD_INITIALIZER(ram_slots); + +static ram_slot *qemu_ram_find_slot(target_phys_addr_t start_addr, + ram_addr_t size) +{ +ram_slot *slot; + +QLIST_FOREACH(slot,ram_slots, next) { +if (slot-start_addr == start_addr slot-size == size) { +return slot; +} + +if (ranges_overlap(start_addr, size, slot-start_addr, slot-size)) { +hw_error(Ram range overlaps existing slot\n); +} +} + +return NULL; +} CODING_STYLE. RamSlot and drop the qemu_ prefix. +int qemu_ram_register(target_phys_addr_t start_addr, ram_addr_t size, + ram_addr_t phys_offset) +{ +ram_slot *slot; + +if (!size) { +return -EINVAL; +} + +assert(!qemu_ram_find_slot(start_addr, size)); + +slot = qemu_mallocz(sizeof(ram_slot)); + +slot-start_addr = start_addr; +slot-size = size; +slot-offset = phys_offset; + +QLIST_INSERT_HEAD(ram_slots, slot, next); + +cpu_register_physical_memory(slot-start_addr, slot-size, slot-offset); + +return 0; +} + +void qemu_ram_unregister(target_phys_addr_t start_addr, ram_addr_t size) +{ +ram_slot *slot; + +if (!size) { +return; +} + +slot = qemu_ram_find_slot(start_addr, size); +assert(slot != NULL); + +QLIST_REMOVE(slot, next); +qemu_free(slot); +cpu_register_physical_memory(start_addr, size, IO_MEM_UNASSIGNED); + +return; +} + +int qemu_ram_for_each_slot(void *opaque, qemu_ram_for_each_slot_fn fn) +{ +ram_slot *slot; + +QLIST_FOREACH(slot,ram_slots, next) { +int ret = fn(opaque, slot-start_addr, slot-size, slot-offset); +if (ret) { +return ret; +} +} +return 0; +} diff --git a/memory.h b/memory.h new file mode 100644 index 000..e7aa5cb --- /dev/null +++ b/memory.h @@ -0,0 +1,44 @@ +#ifndef QEMU_MEMORY_H +#define QEMU_MEMORY_H +/* + * RAM API + * + * Copyright Red Hat, Inc. 2010 + * + * Authors: + * Alex Williamsonalex.william...@redhat.com + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + */ + +#include qemu-common.h +#include cpu-common.h + +typedef int (*qemu_ram_for_each_slot_fn)(void *opaque, + target_phys_addr_t start_addr, + ram_addr_t size, + ram_addr_t phys_offset); + +/** + * qemu_ram_register() : Register a region of guest physical memory + * + * The new region must not overlap an existing region. + */ +int qemu_ram_register(target_phys_addr_t start_addr, ram_addr_t size, + ram_addr_t phys_offset); + +/** + * qemu_ram_unregister() : Unregister a region of guest physical memory + */ +void qemu_ram_unregister(target_phys_addr_t start_addr, ram_addr_t size); + +/** + * qemu_ram_for_each_slot() : Call fn() on each
Re: [Qemu-devel] [RESEND PATCH v3 1/2] Minimal RAM API support
On Mon, Dec 13, 2010 at 8:47 PM, Alex Williamson alex.william...@redhat.com wrote: This adds a minimum chunk of Anthony's RAM API support so that we can identify actual VM RAM versus all the other things that make use of qemu_ram_alloc. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Makefile.objs | 1 + cpu-common.h | 2 + memory.c | 97 + memory.h | 44 ++ 4 files changed, 144 insertions(+), 0 deletions(-) create mode 100644 memory.c create mode 100644 memory.h diff --git a/Makefile.objs b/Makefile.objs index cebb945..47f3c3a 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -172,6 +172,7 @@ hw-obj-y += pci.o pci_bridge.o msix.o msi.o hw-obj-$(CONFIG_PCI) += pci_host.o pcie_host.o hw-obj-$(CONFIG_PCI) += ioh3420.o xio3130_upstream.o xio3130_downstream.o hw-obj-y += watchdog.o +hw-obj-y += memory.o hw-obj-$(CONFIG_ISA_MMIO) += isa_mmio.o hw-obj-$(CONFIG_ECC) += ecc.o hw-obj-$(CONFIG_NAND) += nand.o diff --git a/cpu-common.h b/cpu-common.h index 6d4a898..f08f93b 100644 --- a/cpu-common.h +++ b/cpu-common.h @@ -29,6 +29,8 @@ enum device_endian { /* address in the RAM (different from a physical address) */ typedef unsigned long ram_addr_t; +#include memory.h + /* memory API */ typedef void CPUWriteMemoryFunc(void *opaque, target_phys_addr_t addr, uint32_t value); diff --git a/memory.c b/memory.c new file mode 100644 index 000..742776f --- /dev/null +++ b/memory.c @@ -0,0 +1,97 @@ +/* + * RAM API + * + * Copyright Red Hat, Inc. 2010 + * + * Authors: + * Alex Williamson alex.william...@redhat.com + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + */ +#include memory.h +#include range.h + +typedef struct ram_slot { + target_phys_addr_t start_addr; + ram_addr_t size; + ram_addr_t offset; + QLIST_ENTRY(ram_slot) next; +} ram_slot; Please see CODING_STYLE for structure naming. + +static QLIST_HEAD(ram_slots, ram_slot) ram_slots = + QLIST_HEAD_INITIALIZER(ram_slots); + +static ram_slot *qemu_ram_find_slot(target_phys_addr_t start_addr, + ram_addr_t size) +{ + ram_slot *slot; + + QLIST_FOREACH(slot, ram_slots, next) { + if (slot-start_addr == start_addr slot-size == size) { + return slot; + } + + if (ranges_overlap(start_addr, size, slot-start_addr, slot-size)) { + hw_error(Ram range overlaps existing slot\n); + } + } + + return NULL; +} + +int qemu_ram_register(target_phys_addr_t start_addr, ram_addr_t size, + ram_addr_t phys_offset) +{ + ram_slot *slot; + + if (!size) { + return -EINVAL; + } + + assert(!qemu_ram_find_slot(start_addr, size)); + + slot = qemu_mallocz(sizeof(ram_slot)); Since you initialize every field by hand later, this could be qemu_malloc(). + + slot-start_addr = start_addr; + slot-size = size; + slot-offset = phys_offset; + + QLIST_INSERT_HEAD(ram_slots, slot, next); + + cpu_register_physical_memory(slot-start_addr, slot-size, slot-offset); + + return 0; +} + +void qemu_ram_unregister(target_phys_addr_t start_addr, ram_addr_t size) +{ + ram_slot *slot; + + if (!size) { + return; + } + + slot = qemu_ram_find_slot(start_addr, size); + assert(slot != NULL); + + QLIST_REMOVE(slot, next); + qemu_free(slot); + cpu_register_physical_memory(start_addr, size, IO_MEM_UNASSIGNED); + + return; Useless. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RESEND PATCH] exec: Implement qemu_ram_free_from_ptr()
Required for regions mapped via qemu_ram_alloc_from_ptr(). VFIO and ivshmem will make use of this to remove mappings when devices are hot unplugged. Signed-off-by: Alex Williamson alex.william...@redhat.com --- No comments on original patch. Obvious missing function. Cam has since requested the same function for ivshmem. cpu-common.h |1 + exec.c | 13 + 2 files changed, 14 insertions(+), 0 deletions(-) diff --git a/cpu-common.h b/cpu-common.h index 6d4a898..9b763d0 100644 --- a/cpu-common.h +++ b/cpu-common.h @@ -49,6 +49,7 @@ ram_addr_t cpu_get_physical_page_desc(target_phys_addr_t addr); ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name, ram_addr_t size, void *host); ram_addr_t qemu_ram_alloc(DeviceState *dev, const char *name, ram_addr_t size); +void qemu_ram_free_from_ptr(ram_addr_t addr); void qemu_ram_free(ram_addr_t addr); /* This should only be used for ram local to a device. */ void *qemu_get_ram_ptr(ram_addr_t addr); diff --git a/exec.c b/exec.c index a338495..eea7ea7 100644 --- a/exec.c +++ b/exec.c @@ -2875,6 +2875,19 @@ ram_addr_t qemu_ram_alloc(DeviceState *dev, const char *name, ram_addr_t size) return qemu_ram_alloc_from_ptr(dev, name, size, NULL); } +void qemu_ram_free_from_ptr(ram_addr_t addr) +{ +RAMBlock *block; + +QLIST_FOREACH(block, ram_list.blocks, next) { +if (addr == block-offset) { +QLIST_REMOVE(block, next); +qemu_free(block); +return; +} +} +} + void qemu_ram_free(ram_addr_t addr) { RAMBlock *block; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 1/2] Minimal RAM API support
This adds a minimum chunk of Anthony's RAM API support so that we can identify actual VM RAM versus all the other things that make use of qemu_ram_alloc. Signed-off-by: Alex Williamson alex.william...@redhat.com --- Makefile.objs |1 + cpu-common.h |2 + memory.c | 94 + memory.h | 44 +++ 4 files changed, 141 insertions(+), 0 deletions(-) create mode 100644 memory.c create mode 100644 memory.h diff --git a/Makefile.objs b/Makefile.objs index cebb945..47f3c3a 100644 --- a/Makefile.objs +++ b/Makefile.objs @@ -172,6 +172,7 @@ hw-obj-y += pci.o pci_bridge.o msix.o msi.o hw-obj-$(CONFIG_PCI) += pci_host.o pcie_host.o hw-obj-$(CONFIG_PCI) += ioh3420.o xio3130_upstream.o xio3130_downstream.o hw-obj-y += watchdog.o +hw-obj-y += memory.o hw-obj-$(CONFIG_ISA_MMIO) += isa_mmio.o hw-obj-$(CONFIG_ECC) += ecc.o hw-obj-$(CONFIG_NAND) += nand.o diff --git a/cpu-common.h b/cpu-common.h index 6d4a898..f08f93b 100644 --- a/cpu-common.h +++ b/cpu-common.h @@ -29,6 +29,8 @@ enum device_endian { /* address in the RAM (different from a physical address) */ typedef unsigned long ram_addr_t; +#include memory.h + /* memory API */ typedef void CPUWriteMemoryFunc(void *opaque, target_phys_addr_t addr, uint32_t value); diff --git a/memory.c b/memory.c new file mode 100644 index 000..07cb020 --- /dev/null +++ b/memory.c @@ -0,0 +1,94 @@ +/* + * RAM API + * + * Copyright Red Hat, Inc. 2010 + * + * Authors: + * Alex Williamson alex.william...@redhat.com + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + */ +#include memory.h +#include range.h + +typedef struct RamSlot { +target_phys_addr_t start_addr; +ram_addr_t size; +ram_addr_t offset; +QLIST_ENTRY(RamSlot) next; +} RamSlot; + +static QLIST_HEAD(ram_slot_list, RamSlot) ram_slot_list = +QLIST_HEAD_INITIALIZER(ram_slot_list); + +static RamSlot *ram_find_slot(target_phys_addr_t start_addr, ram_addr_t size) +{ +RamSlot *slot; + +QLIST_FOREACH(slot, ram_slot_list, next) { +if (slot-start_addr == start_addr slot-size == size) { +return slot; +} + +if (ranges_overlap(start_addr, size, slot-start_addr, slot-size)) { +hw_error(Ram range overlaps existing slot\n); +} +} + +return NULL; +} + +int ram_register(target_phys_addr_t start_addr, ram_addr_t size, + ram_addr_t phys_offset) +{ +RamSlot *slot; + +if (!size) { +return -EINVAL; +} + +assert(!ram_find_slot(start_addr, size)); + +slot = qemu_malloc(sizeof(RamSlot)); + +slot-start_addr = start_addr; +slot-size = size; +slot-offset = phys_offset; + +QLIST_INSERT_HEAD(ram_slot_list, slot, next); + +cpu_register_physical_memory(slot-start_addr, slot-size, slot-offset); + +return 0; +} + +void ram_unregister(target_phys_addr_t start_addr, ram_addr_t size) +{ +RamSlot *slot; + +if (!size) { +return; +} + +slot = ram_find_slot(start_addr, size); +assert(slot != NULL); + +QLIST_REMOVE(slot, next); +qemu_free(slot); +cpu_register_physical_memory(start_addr, size, IO_MEM_UNASSIGNED); +} + +int ram_for_each_slot(void *opaque, ram_for_each_slot_fn fn) +{ +RamSlot *slot; + +QLIST_FOREACH(slot, ram_slot_list, next) { +int ret = fn(opaque, slot-start_addr, slot-size, slot-offset); +if (ret) { +return ret; +} +} +return 0; +} diff --git a/memory.h b/memory.h new file mode 100644 index 000..98c85ea --- /dev/null +++ b/memory.h @@ -0,0 +1,44 @@ +#ifndef QEMU_MEMORY_H +#define QEMU_MEMORY_H +/* + * RAM API + * + * Copyright Red Hat, Inc. 2010 + * + * Authors: + * Alex Williamson alex.william...@redhat.com + * + * This work is licensed under the terms of the GNU GPL, version 2. See + * the COPYING file in the top-level directory. + * + */ + +#include qemu-common.h +#include cpu-common.h + +typedef int (*ram_for_each_slot_fn)(void *opaque, +target_phys_addr_t start_addr, +ram_addr_t size, +ram_addr_t phys_offset); + +/** + * ram_register() : Register a region of guest physical memory + * + * The new region must not overlap an existing region. + */ +int ram_register(target_phys_addr_t start_addr, ram_addr_t size, + ram_addr_t phys_offset); + +/** + * ram_unregister() : Unregister a region of guest physical memory + */ +void ram_unregister(target_phys_addr_t start_addr, ram_addr_t size); + +/** + * ram_for_each_slot() : Call fn() on each registered region + * + * Stop on non-zero return from fn(). + */ +int ram_for_each_slot(void *opaque, ram_for_each_slot_fn fn); + +#endif /* QEMU_MEMORY_H */ -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a
[PATCH v4 0/2] Minimal RAM API support
Update per comments, Thanks, Alex v4: - ram_slot - RamSlot (per CODING_STYLE) - drop qemu_ prefix from functions (per CODING_STYLE) - mallocz - malloc - drop extraneous return from void function v3: - Address review comments - pc registers all memory below 4G in one chunk Let me know if there are any further issues. v2: - Move to Makefile.objs - Move structures to memory.c and create a callback function - Fix memory leak I haven't moved to the state parameter because there should only be a single instance of this per VM. The state parameter seems like it would add complications in setup and function calling, but maybe point me to an example if I'm off base. v1: For VFIO based device assignment, we need to know what guest memory areas are actual RAM. RAMBlocks have long since become a grab bag of misc allocations, so aren't effective for this. Anthony has had a RAM API in mind for a while now that addresses this problem. This implements just enough of it so that we have an interface to get actual guest memory physical addresses to setup the host IOMMU. We can continue building a full RAM API on top of this stub. Anthony, feel free to add copyright to memory.c as it's based on your initial implementation. I had to add something since the file in your branch just copies a header with Frabrice's copywrite. --- Alex Williamson (2): RAM API: Make use of it for x86 PC Minimal RAM API support Makefile.objs |1 + cpu-common.h |2 + hw/pc.c |9 ++--- memory.c | 94 + memory.h | 44 +++ 5 files changed, 144 insertions(+), 6 deletions(-) create mode 100644 memory.c create mode 100644 memory.h -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 2/2] RAM API: Make use of it for x86 PC
Register the actual VM RAM using the new API Signed-off-by: Alex Williamson alex.william...@redhat.com --- hw/pc.c |9 +++-- 1 files changed, 3 insertions(+), 6 deletions(-) diff --git a/hw/pc.c b/hw/pc.c index e1b2667..87adca2 100644 --- a/hw/pc.c +++ b/hw/pc.c @@ -913,14 +913,11 @@ void pc_memory_init(ram_addr_t ram_size, /* allocate RAM */ ram_addr = qemu_ram_alloc(NULL, pc.ram, below_4g_mem_size + above_4g_mem_size); -cpu_register_physical_memory(0, 0xa, ram_addr); -cpu_register_physical_memory(0x10, - below_4g_mem_size - 0x10, - ram_addr + 0x10); +ram_register(0, below_4g_mem_size, ram_addr); #if TARGET_PHYS_ADDR_BITS 32 if (above_4g_mem_size 0) { -cpu_register_physical_memory(0x1ULL, above_4g_mem_size, - ram_addr + below_4g_mem_size); +ram_register(0x1ULL, above_4g_mem_size, + ram_addr + below_4g_mem_size); } #endif -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] RFC: delay pci_update_mappings for 64-bit BARs
Do not call pci_update_mappings on the lower 32-bits of a 64-bit bar. Wait for the upper 32 or else Qemu will try to map on just the lower 32 which is probably going to corrupt memory. I was encountering crashes when mapping certain PCI region sizes. The problem turns out that pci_update_mappings is being called without all 64-bits in the BAR. For example when mapping to 0x18000, once the lower 32-bits were written the remapping happened (mapping to 0x800) which would overwrite something. I'm not certain if this is completely correct, I'm simply testing the lower 4-bits to only be MEM_TYPE_64 flag. Upper 32-bit address parts can be values like 0xff which is tricky to test against. Cam --- hw/pci.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/hw/pci.c b/hw/pci.c index 438c0d1..3b81792 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -1000,6 +1000,9 @@ void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l) { int i, was_irq_disabled = pci_irq_disabled(d); uint32_t config_size = pci_config_size(d); +int is_64 = 0; + +is_64 = ((val 0xf) == PCI_BASE_ADDRESS_MEM_TYPE_64); for (i = 0; i l addr + i config_size; val = 8, ++i) { uint8_t wmask = d-wmask[addr + i]; @@ -1008,7 +1011,7 @@ void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l) d-config[addr + i] = (d-config[addr + i] ~wmask) | (val wmask); d-config[addr + i] = ~(val w1cmask); /* W1C: Write 1 to Clear */ } -if (ranges_overlap(addr, l, PCI_BASE_ADDRESS_0, 24) || +if ((ranges_overlap(addr, l, PCI_BASE_ADDRESS_0, 24) (!is_64)) || ranges_overlap(addr, l, PCI_ROM_ADDRESS, 4) || ranges_overlap(addr, l, PCI_ROM_ADDRESS1, 4) || range_covers_byte(addr, l, PCI_COMMAND)) -- 1.7.0.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/4] KVM genirq: Enable adaptive IRQ sharing for passed-through devices
Am 13.12.2010 11:10, Michael S. Tsirkin wrote: On Sun, Dec 12, 2010 at 12:22:40PM +0100, Jan Kiszka wrote: The result may look simpler on first glance than v1, but it comes with more subtle race scenarios IMO. I thought them through, hopefully catching all, but I would appreciate any skeptical review. Thought about the races till my head hurt, and yes, they all seem to be handled correctly. FWIW Ouch, I'm endlessly sorry for causing this pain. Reviewed-by: Michael S. Tsirkin m...@redhat.com Thanks! Jan signature.asc Description: OpenPGP digital signature
[PATCH v3 0/4] KVM genirq: Enable adaptive IRQ sharing for passed-through devices
This addresses the review comments of the previous round: - renamed irq_data::status to drv_status - moved drv_status around to unbreak GENERIC_HARDIRQS_NO_DEPRECATED - fixed signature of get_irq_status (irq is now unsigned int) - converted register_lock into a global one - fixed critical white space breakage (that I just left in to check if anyone is actually reading the code, of course...) Note: The KVM patch still depends on http://thread.gmane.org/gmane.comp.emulators.kvm.devel/64515 Thanks for all comments! Final but critical question: Who will pick up which bits? Jan Kiszka (4): genirq: Introduce driver-readable IRQ status word genirq: Inform handler about line sharing state genirq: Add support for IRQF_COND_ONESHOT KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices Documentation/kvm/api.txt | 27 arch/x86/kvm/x86.c|1 + include/linux/interrupt.h | 15 ++ include/linux/irq.h |2 + include/linux/kvm.h |6 + include/linux/kvm_host.h | 10 ++- kernel/irq/manage.c | 77 ++- virt/kvm/assigned-dev.c | 336 - 8 files changed, 436 insertions(+), 38 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 1/4] genirq: Introduce driver-readable IRQ status word
From: Jan Kiszka jan.kis...@siemens.com This associates a status word with every IRQ descriptor. Drivers can obtain its content via get_irq_status(irq). First use case will be propagating the interrupt sharing state. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- include/linux/interrupt.h |2 ++ include/linux/irq.h |2 ++ kernel/irq/manage.c | 15 +++ 3 files changed, 19 insertions(+), 0 deletions(-) diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 79d0c4f..4c1aa72 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -126,6 +126,8 @@ struct irqaction { extern irqreturn_t no_action(int cpl, void *dev_id); +extern unsigned long get_irq_status(unsigned int irq); + #ifdef CONFIG_GENERIC_HARDIRQS extern int __must_check request_threaded_irq(unsigned int irq, irq_handler_t handler, diff --git a/include/linux/irq.h b/include/linux/irq.h index abde252..8bdb421 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -96,6 +96,7 @@ struct msi_desc; * methods, to allow shared chip implementations * @msi_desc: MSI descriptor * @affinity: IRQ affinity on SMP + * @drv_status:driver-readable status flags (IRQS_*) * * The fields here need to overlay the ones in irq_desc until we * cleaned up the direct references and switched everything over to @@ -111,6 +112,7 @@ struct irq_data { #ifdef CONFIG_SMP cpumask_var_t affinity; #endif + unsigned long drv_status; }; /** diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 5f92acc..2ea0d30 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -1157,3 +1157,18 @@ int request_any_context_irq(unsigned int irq, irq_handler_t handler, return !ret ? IRQC_IS_HARDIRQ : ret; } EXPORT_SYMBOL_GPL(request_any_context_irq); + +/** + * get_irq_status - read interrupt line status word + * @irq: Interrupt line of the status word + * + * This returns the current content of the status word associated with + * the given interrupt line. See IRQS_* flags for details. + */ +unsigned long get_irq_status(unsigned int irq) +{ + struct irq_desc *desc = irq_to_desc(irq); + + return desc ? desc-irq_data.drv_status : 0; +} +EXPORT_SYMBOL_GPL(get_irq_status); -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 4/4] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices
From: Jan Kiszka jan.kis...@siemens.com PCI 2.3 allows to generically disable IRQ sources at device level. This enables us to share IRQs of such devices on the host side when passing them to a guest. However, IRQ disabling via the PCI config space is more costly than masking the line via disable_irq. Therefore we register the IRQ in adaptive mode and switch between line and device level disabling on demand. This feature is optional, user space has to request it explicitly as it also has to inform us about its view of PCI_COMMAND_INTX_DISABLE. That way, we can avoid unmasking the interrupt and signaling it if the guest masked it via the PCI config space. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- Documentation/kvm/api.txt | 27 arch/x86/kvm/x86.c|1 + include/linux/kvm.h |6 + include/linux/kvm_host.h | 10 ++- virt/kvm/assigned-dev.c | 336 - 5 files changed, 346 insertions(+), 34 deletions(-) diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt index e1a9297..1c34e25 100644 --- a/Documentation/kvm/api.txt +++ b/Documentation/kvm/api.txt @@ -1112,6 +1112,14 @@ following flags are specified: /* Depends on KVM_CAP_IOMMU */ #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1 0) +/* The following two depend on KVM_CAP_PCI_2_3 */ +#define KVM_DEV_ASSIGN_PCI_2_3 (1 1) +#define KVM_DEV_ASSIGN_MASK_INTX (1 2) + +If KVM_DEV_ASSIGN_PCI_2_3 is set, the kernel will manage legacy INTx interrupts +via the PCI-2.3-compliant device-level mask, but only if IRQ sharing with other +assigned or host devices requires it. KVM_DEV_ASSIGN_MASK_INTX specifies the +guest's view on the INTx mask, see KVM_ASSIGN_SET_INTX_MASK for details. 4.48 KVM_DEASSIGN_PCI_DEVICE @@ -1263,6 +1271,25 @@ struct kvm_assigned_msix_entry { __u16 padding[3]; }; +4.54 KVM_ASSIGN_SET_INTX_MASK + +Capability: KVM_CAP_PCI_2_3 +Architectures: x86 +Type: vm ioctl +Parameters: struct kvm_assigned_pci_dev (in) +Returns: 0 on success, -1 on error + +Informs the kernel about the guest's view on the INTx mask. As long as the +guest masks the legacy INTx, the kernel will refrain from unmasking it at +hardware level and will not assert the guest's IRQ line. User space is still +responsible for applying this state to the assigned device's real config space. +To avoid that the kernel overwrites the state user space wants to set, +KVM_ASSIGN_SET_INTX_MASK has to be called prior to updating the config space. + +See KVM_ASSIGN_DEV_IRQ for the data structure. The target device is specified +by assigned_dev_id. In the flags field, only KVM_DEV_ASSIGN_MASK_INTX is +evaluated. + 5. The kvm_run structure Application code obtains a pointer to the kvm_run structure by diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ed373ba..8775a54 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1965,6 +1965,7 @@ int kvm_dev_ioctl_check_extension(long ext) case KVM_CAP_X86_ROBUST_SINGLESTEP: case KVM_CAP_XSAVE: case KVM_CAP_ASYNC_PF: + case KVM_CAP_PCI_2_3: r = 1; break; case KVM_CAP_COALESCED_MMIO: diff --git a/include/linux/kvm.h b/include/linux/kvm.h index ea2dc1a..3cadb42 100644 --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -541,6 +541,7 @@ struct kvm_ppc_pvinfo { #define KVM_CAP_PPC_GET_PVINFO 57 #define KVM_CAP_PPC_IRQ_LEVEL 58 #define KVM_CAP_ASYNC_PF 59 +#define KVM_CAP_PCI_2_3 60 #ifdef KVM_CAP_IRQ_ROUTING @@ -677,6 +678,9 @@ struct kvm_clock_data { #define KVM_SET_PIT2 _IOW(KVMIO, 0xa0, struct kvm_pit_state2) /* Available with KVM_CAP_PPC_GET_PVINFO */ #define KVM_PPC_GET_PVINFO _IOW(KVMIO, 0xa1, struct kvm_ppc_pvinfo) +/* Available with KVM_CAP_PCI_2_3 */ +#define KVM_ASSIGN_SET_INTX_MASK _IOW(KVMIO, 0xa2, \ + struct kvm_assigned_pci_dev) /* * ioctls for vcpu fds @@ -742,6 +746,8 @@ struct kvm_clock_data { #define KVM_SET_XCRS _IOW(KVMIO, 0xa7, struct kvm_xcrs) #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1 0) +#define KVM_DEV_ASSIGN_PCI_2_3 (1 1) +#define KVM_DEV_ASSIGN_MASK_INTX (1 2) struct kvm_assigned_pci_dev { __u32 assigned_dev_id; diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index ac4e83a..4f95070 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -477,6 +477,12 @@ struct kvm_irq_ack_notifier { void (*irq_acked)(struct kvm_irq_ack_notifier *kian); }; +enum kvm_intx_state { + KVM_INTX_ENABLED, + KVM_INTX_LINE_DISABLED, + KVM_INTX_DEVICE_DISABLED, +}; + struct kvm_assigned_dev_kernel { struct kvm_irq_ack_notifier ack_notifier; struct list_head list; @@ -486,7 +492,7 @@ struct kvm_assigned_dev_kernel { int host_devfn; unsigned int entries_nr; int host_irq; - bool host_irq_disabled; + unsigned long
[PATCH v3 2/4] genirq: Inform handler about line sharing state
From: Jan Kiszka jan.kis...@siemens.com This enabled interrupt handlers to retrieve the current line sharing state via the new interrupt status word so that they can adapt to it. The switch from shared to exclusive is generally uncritical and can thus be performed on demand. However, preparing a line for shared mode may require preparational steps of the currently registered handler. It can therefore request an ahead-of-time notification via IRQF_ADAPTIVE. The notification consists of an exceptional handler invocation with IRQS_MAKE_SHAREABLE set in the status word. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- include/linux/interrupt.h | 10 + kernel/irq/manage.c | 47 ++-- 2 files changed, 54 insertions(+), 3 deletions(-) diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 4c1aa72..12e5fc0 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -55,6 +55,7 @@ *Used by threaded interrupts which need to keep the *irq line disabled until the threaded handler has been run. * IRQF_NO_SUSPEND - Do not disable this IRQ during suspend + * IRQF_ADAPTIVE - Request notification about upcoming interrupt line sharing * */ #define IRQF_DISABLED 0x0020 @@ -67,6 +68,7 @@ #define IRQF_IRQPOLL 0x1000 #define IRQF_ONESHOT 0x2000 #define IRQF_NO_SUSPEND0x4000 +#define IRQF_ADAPTIVE 0x8000 #define IRQF_TIMER (__IRQF_TIMER | IRQF_NO_SUSPEND) @@ -126,6 +128,14 @@ struct irqaction { extern irqreturn_t no_action(int cpl, void *dev_id); +/* + * Driver-readable IRQ line status flags: + * IRQS_SHARED - line is shared between multiple handlers + * IRQS_MAKE_SHAREABLE - in the process of making an exclusive line shareable + */ +#define IRQS_SHARED0x0001 +#define IRQS_MAKE_SHAREABLE0x0002 + extern unsigned long get_irq_status(unsigned int irq); #ifdef CONFIG_GENERIC_HARDIRQS diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 2ea0d30..2dd4eef 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -14,9 +14,12 @@ #include linux/interrupt.h #include linux/slab.h #include linux/sched.h +#include linux/mutex.h #include internals.h +static DEFINE_MUTEX(register_lock); + /** * synchronize_irq - wait for pending IRQ handlers (on other CPUs) * @irq: interrupt number to wait for @@ -754,6 +757,8 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new) old = *old_ptr; } while (old); shared = 1; + + desc-irq_data.drv_status |= IRQS_SHARED; } if (!shared) { @@ -883,6 +888,7 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id) { struct irq_desc *desc = irq_to_desc(irq); struct irqaction *action, **action_ptr; + bool single_handler = false; unsigned long flags; WARN(in_interrupt(), Trying to free IRQ %d from IRQ context!\n, irq); @@ -928,7 +934,8 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id) desc-irq_data.chip-irq_shutdown(desc-irq_data); else desc-irq_data.chip-irq_disable(desc-irq_data); - } + } else if (!desc-action-next) + single_handler = true; #ifdef CONFIG_SMP /* make sure affinity_hint is cleaned up */ @@ -943,6 +950,9 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id) /* Make sure it's not being used on another CPU: */ synchronize_irq(irq); + if (single_handler) + desc-irq_data.drv_status = ~IRQS_SHARED; + #ifdef CONFIG_DEBUG_SHIRQ /* * It's a shared IRQ -- the driver ought to be prepared for an IRQ @@ -1002,9 +1012,13 @@ void free_irq(unsigned int irq, void *dev_id) if (!desc) return; + mutex_lock(register_lock); + chip_bus_lock(desc); kfree(__free_irq(irq, dev_id)); chip_bus_sync_unlock(desc); + + mutex_unlock(register_lock); } EXPORT_SYMBOL(free_irq); @@ -1055,7 +1069,7 @@ int request_threaded_irq(unsigned int irq, irq_handler_t handler, irq_handler_t thread_fn, unsigned long irqflags, const char *devname, void *dev_id) { - struct irqaction *action; + struct irqaction *action, *old_action; struct irq_desc *desc; int retval; @@ -1091,12 +1105,39 @@ int request_threaded_irq(unsigned int irq, irq_handler_t handler, action-name = devname; action-dev_id = dev_id; + mutex_lock(register_lock); + + old_action = desc-action; + if (old_action (old_action-flags IRQF_ADAPTIVE) + !(desc-irq_data.drv_status IRQS_SHARED)) { + /* +* Signal the old
[PATCH v3 3/4] genirq: Add support for IRQF_COND_ONESHOT
From: Jan Kiszka jan.kis...@siemens.com Provide an adaptive version of IRQF_ONESHOT: If the line is exclusively used, IRQF_COND_ONESHOT provides the same semantics as IRQF_ONESHOT. If it is shared, the line will be unmasked directly after the hardirq handler, just as if IRQF_COND_ONESHOT was not provided. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- include/linux/interrupt.h |3 +++ kernel/irq/manage.c | 19 --- 2 files changed, 19 insertions(+), 3 deletions(-) diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h index 12e5fc0..bbb16f4 100644 --- a/include/linux/interrupt.h +++ b/include/linux/interrupt.h @@ -56,6 +56,8 @@ *irq line disabled until the threaded handler has been run. * IRQF_NO_SUSPEND - Do not disable this IRQ during suspend * IRQF_ADAPTIVE - Request notification about upcoming interrupt line sharing + * IRQF_COND_ONESHOT - If line is not shared, keep interrupt disabled after + * hardirq handler finshed. * */ #define IRQF_DISABLED 0x0020 @@ -69,6 +71,7 @@ #define IRQF_ONESHOT 0x2000 #define IRQF_NO_SUSPEND0x4000 #define IRQF_ADAPTIVE 0x8000 +#define IRQF_COND_ONESHOT 0x0001 #define IRQF_TIMER (__IRQF_TIMER | IRQF_NO_SUSPEND) diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c index 2dd4eef..9a73633 100644 --- a/kernel/irq/manage.c +++ b/kernel/irq/manage.c @@ -583,7 +583,7 @@ static int irq_thread(void *data) struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2, }; struct irqaction *action = data; struct irq_desc *desc = irq_to_desc(action-irq); - int wake, oneshot = desc-status IRQ_ONESHOT; + int wake, oneshot; sched_setscheduler(current, SCHED_FIFO, param); current-irqaction = action; @@ -606,6 +606,7 @@ static int irq_thread(void *data) desc-status |= IRQ_PENDING; raw_spin_unlock_irq(desc-lock); } else { + oneshot = desc-status IRQ_ONESHOT; raw_spin_unlock_irq(desc-lock); action-thread_fn(action-irq, action-dev_id); @@ -759,6 +760,15 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new) shared = 1; desc-irq_data.drv_status |= IRQS_SHARED; + desc-status = ~IRQ_ONESHOT; + + /* Unmask if the interrupt was masked due to oneshot mode. */ + if ((desc-status +(IRQ_INPROGRESS | IRQ_DISABLED | IRQ_MASKED)) == + IRQ_MASKED) { + desc-irq_data.chip-irq_unmask(desc-irq_data); + desc-status = ~IRQ_MASKED; + } } if (!shared) { @@ -783,7 +793,7 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new) desc-status = ~(IRQ_AUTODETECT | IRQ_WAITING | IRQ_ONESHOT | IRQ_INPROGRESS | IRQ_SPURIOUS_DISABLED); - if (new-flags IRQF_ONESHOT) + if (new-flags (IRQF_ONESHOT | IRQF_COND_ONESHOT)) desc-status |= IRQ_ONESHOT; if (!(desc-status IRQ_NOAUTOEN)) { @@ -934,8 +944,11 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id) desc-irq_data.chip-irq_shutdown(desc-irq_data); else desc-irq_data.chip-irq_disable(desc-irq_data); - } else if (!desc-action-next) + } else if (!desc-action-next) { single_handler = true; + if (desc-action-flags IRQF_COND_ONESHOT) + desc-status |= IRQ_ONESHOT; + } #ifdef CONFIG_SMP /* make sure affinity_hint is cleaned up */ -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/5] pci-assign: Host IRQ sharing suppport + some fixes and cleanups
This series includes cleanups of the PCI config access of assigned devices, fixes a corner case in this area, removes that suspicious VGA hunk from assigned_dev_pci_read_config, and finally enables support for the latest host IRQ sharing support via PCI-2.3 interrupt masking. See the patches for details. Jan Kiszka (5): pci-assign: Clean up assigned_dev_pci_read/write_config pci-assign: Fix dword read at PCI_COMMAND pci-assign: Remove suspicious hunk from assigned_dev_pci_read_config pci-assign: Convert need_emulate_cmd into a bitmask pci-assign: Use PCI-2.3-based shared legacy interrupts hw/device-assignment.c | 100 --- hw/device-assignment.h |2 +- qemu-kvm.c |8 qemu-kvm.h |3 + 4 files changed, 88 insertions(+), 25 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] pci-assign: Clean up assigned_dev_pci_read/write_config
From: Jan Kiszka jan.kis...@siemens.com Use rages_overlap and proper constants to match the access range against regions that need special handling. This also fixes yet uncaught high-byte write access to the command register. Moreover, use more constants instead of magic numbers. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- hw/device-assignment.c | 39 +-- 1 files changed, 29 insertions(+), 10 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 50c6408..bc3a57b 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -438,13 +438,20 @@ static void assigned_dev_pci_write_config(PCIDevice *d, uint32_t address, return assigned_device_pci_cap_write_config(d, address, val, len); } -if (address == 0x4) { +if (ranges_overlap(address, len, PCI_COMMAND, 2)) { pci_default_write_config(d, address, val, len); /* Continue to program the card */ } -if ((address = 0x10 address = 0x24) || address == 0x30 || -address == 0x34 || address == 0x3c || address == 0x3d) { +/* + * Catch access to + * - base address registers + * - ROM base address capability pointer + * - interrupt line pin + */ +if (ranges_overlap(address, len, PCI_BASE_ADDRESS_0, 24) || +ranges_overlap(address, len, PCI_ROM_ADDRESS, 8) || +ranges_overlap(address, len, PCI_INTERRUPT_LINE, 2)) { /* used for update-mappings (BAR emulation) */ pci_default_write_config(d, address, val, len); return; @@ -484,9 +491,20 @@ static uint32_t assigned_dev_pci_read_config(PCIDevice *d, uint32_t address, return val; } -if (address 0x4 || (pci_dev-need_emulate_cmd address == 0x4) || - (address = 0x10 address = 0x24) || address == 0x30 || -address == 0x34 || address == 0x3c || address == 0x3d) { +/* + * Catch access to + * - vendor device ID + * - command register (if emulation needed) + * - base address registers + * - ROM base address capability pointer + * - interrupt line pin + */ +if (ranges_overlap(address, len, PCI_VENDOR_ID, 4) || +(pci_dev-need_emulate_cmd + ranges_overlap(address, len, PCI_COMMAND, 2)) || +ranges_overlap(address, len, PCI_BASE_ADDRESS_0, 24) || +ranges_overlap(address, len, PCI_ROM_ADDRESS, 8) || +ranges_overlap(address, len, PCI_INTERRUPT_LINE, 2)) { val = pci_default_read_config(d, address, len); DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n, (d-devfn 3) 0x1F, (d-devfn 0x7), address, val, len); @@ -517,10 +535,11 @@ do_log: if (!pci_dev-cap.available) { /* kill the special capabilities */ -if (address == 4 len == 4) -val = ~0x10; -else if (address == 6) -val = ~0x10; +if (address == PCI_COMMAND len == 4) { +val = ~(PCI_STATUS_CAP_LIST 16); +} else if (address == PCI_STATUS) { +val = ~PCI_STATUS_CAP_LIST; +} } return val; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] pci-assign: Fix dword read at PCI_COMMAND
From: Jan Kiszka jan.kis...@siemens.com If we emulate the command register, we must only read its content from the shadow config space. For dword read of both PCI_COMMAND and PCI_STATUS, at least the latter must be read from the device. For simplicity reasons and as the code path is not considered performance critical for the affected SRIOV devices, the fix performes device access to the command word unconditionally, even if emulation is enabled and only that word is read. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- hw/device-assignment.c | 14 +++--- 1 files changed, 11 insertions(+), 3 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index bc3a57b..6ff1456 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -494,14 +494,11 @@ static uint32_t assigned_dev_pci_read_config(PCIDevice *d, uint32_t address, /* * Catch access to * - vendor device ID - * - command register (if emulation needed) * - base address registers * - ROM base address capability pointer * - interrupt line pin */ if (ranges_overlap(address, len, PCI_VENDOR_ID, 4) || -(pci_dev-need_emulate_cmd - ranges_overlap(address, len, PCI_COMMAND, 2)) || ranges_overlap(address, len, PCI_BASE_ADDRESS_0, 24) || ranges_overlap(address, len, PCI_ROM_ADDRESS, 8) || ranges_overlap(address, len, PCI_INTERRUPT_LINE, 2)) { @@ -533,6 +530,17 @@ do_log: DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n, (d-devfn 3) 0x1F, (d-devfn 0x7), address, val, len); +if (pci_dev-need_emulate_cmd +ranges_overlap(address, len, PCI_COMMAND, 2)) { +if (address == PCI_COMMAND) { +val = 0x; +val |= pci_default_read_config(d, PCI_COMMAND, 2); +} else { +/* high-byte access */ +val = pci_default_read_config(d, PCI_COMMAND+1, 1); +} +} + if (!pci_dev-cap.available) { /* kill the special capabilities */ if (address == PCI_COMMAND len == 4) { -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5] pci-assign: Convert need_emulate_cmd into a bitmask
From: Jan Kiszka jan.kis...@siemens.com Define a mask of PCI command register bits that need to be emulated, i.e. read back from their shadow state. We will need this for selectively emulating the INTx mask bit. Note: No initialization of emulate_cmd_mask to zero needed, the device state is already zero-initialized. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- hw/device-assignment.c | 18 ++ hw/device-assignment.h |2 +- 2 files changed, 11 insertions(+), 9 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index ef045f4..26d3bd7 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -525,14 +525,17 @@ again: DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n, (d-devfn 3) 0x1F, (d-devfn 0x7), address, val, len); -if (pci_dev-need_emulate_cmd +if (pci_dev-emulate_cmd_mask ranges_overlap(address, len, PCI_COMMAND, 2)) { if (address == PCI_COMMAND) { -val = 0x; -val |= pci_default_read_config(d, PCI_COMMAND, 2); +val = ~pci_dev-emulate_cmd_mask; +val |= pci_default_read_config(d, PCI_COMMAND, 2) +pci_dev-emulate_cmd_mask; } else { /* high-byte access */ -val = pci_default_read_config(d, PCI_COMMAND+1, 1); +val = ~(pci_dev-emulate_cmd_mask 8); +val |= pci_default_read_config(d, PCI_COMMAND+1, 1) +(pci_dev-emulate_cmd_mask 8); } } @@ -800,10 +803,9 @@ again: /* dealing with virtual function device */ snprintf(name, sizeof(name), %sphysfn/, dir); -if (!stat(name, statbuf)) - pci_dev-need_emulate_cmd = 1; -else - pci_dev-need_emulate_cmd = 0; +if (!stat(name, statbuf)) { +pci_dev-emulate_cmd_mask = 0x; +} dev-region_number = r; return 0; diff --git a/hw/device-assignment.h b/hw/device-assignment.h index c94a730..9ead022 100644 --- a/hw/device-assignment.h +++ b/hw/device-assignment.h @@ -109,7 +109,7 @@ typedef struct AssignedDevice { void *msix_table_page; target_phys_addr_t msix_table_addr; int mmio_index; -int need_emulate_cmd; +uint32_t emulate_cmd_mask; char *configfd_name; QLIST_ENTRY(AssignedDevice) next; } AssignedDevice; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/5] pci-assign: Use PCI-2.3-based shared legacy interrupts
From: Jan Kiszka jan.kis...@siemens.com Enable the new KVM feature that allows legacy interrupt sharing for PCI-2.3-compliant devices. This requires to synchronize any guest change of the INTx mask bit to the kernel. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- hw/device-assignment.c | 38 +- qemu-kvm.c |8 qemu-kvm.h |3 +++ 3 files changed, 44 insertions(+), 5 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 26d3bd7..cf75c52 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -423,12 +423,21 @@ static uint8_t pci_find_cap_offset(PCIDevice *d, uint8_t cap, uint8_t start) return 0; } +static uint32_t calc_assigned_dev_id(uint16_t seg, uint8_t bus, uint8_t devfn) +{ +return (uint32_t)seg 16 | (uint32_t)bus 8 | (uint32_t)devfn; +} + static void assigned_dev_pci_write_config(PCIDevice *d, uint32_t address, uint32_t val, int len) { int fd; ssize_t ret; AssignedDevice *pci_dev = container_of(d, AssignedDevice, dev); +struct kvm_assigned_pci_dev assigned_dev_data; +#ifdef KVM_CAP_PCI_2_3 +bool intx_masked, update_intx_mask; +#endif /* KVM_CAP_PCI_2_3 */ DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n, ((d-devfn 3) 0x1F), (d-devfn 0x7), @@ -439,6 +448,26 @@ static void assigned_dev_pci_write_config(PCIDevice *d, uint32_t address, } if (ranges_overlap(address, len, PCI_COMMAND, 2)) { +#ifdef KVM_CAP_PCI_2_3 +update_intx_mask = false; +if (address == PCI_COMMAND+1) { +intx_masked = val (PCI_COMMAND_INTX_DISABLE 8); +update_intx_mask = true; +} else if (len = 2) { +intx_masked = val PCI_COMMAND_INTX_DISABLE; +update_intx_mask = true; +} +if (update_intx_mask) { +memset(assigned_dev_data, 0, sizeof(assigned_dev_data)); +assigned_dev_data.assigned_dev_id = +calc_assigned_dev_id(pci_dev-h_segnr, pci_dev-h_busnr, + pci_dev-h_devfn); +if (intx_masked) { +assigned_dev_data.flags = KVM_DEV_ASSIGN_MASK_INTX; +} +kvm_assign_set_intx_mask(kvm_context, assigned_dev_data); +} +#endif /* KVM_CAP_PCI_2_3 */ pci_default_write_config(d, address, val, len); /* Continue to program the card */ } @@ -876,11 +905,6 @@ static void free_assigned_device(AssignedDevice *dev) } } -static uint32_t calc_assigned_dev_id(uint16_t seg, uint8_t bus, uint8_t devfn) -{ -return (uint32_t)seg 16 | (uint32_t)bus 8 | (uint32_t)devfn; -} - static void assign_failed_examine(AssignedDevice *dev) { char name[PATH_MAX], dir[PATH_MAX], driver[PATH_MAX] = {}, *ns; @@ -971,6 +995,10 @@ static int assign_device(AssignedDevice *dev) cause host memory corruption if the device issues DMA write requests!\n); } +#ifdef KVM_CAP_PCI_2_3 +assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3; +dev-emulate_cmd_mask |= PCI_COMMAND_INTX_DISABLE; +#endif /* KVM_CAP_PCI_2_3 */ r = kvm_assign_pci_device(kvm_context, assigned_dev_data); if (r 0) { diff --git a/qemu-kvm.c b/qemu-kvm.c index 471306b..8157b4f 100644 --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -740,6 +740,14 @@ int kvm_deassign_pci_device(kvm_context_t kvm, } #endif +#ifdef KVM_CAP_PCI_2_3 +int kvm_assign_set_intx_mask(kvm_context_t kvm, + struct kvm_assigned_pci_dev *assigned_dev) +{ +return kvm_vm_ioctl(kvm_state, KVM_ASSIGN_SET_INTX_MASK, assigned_dev); +} +#endif + int kvm_reinject_control(kvm_context_t kvm, int pit_reinject) { #ifdef KVM_CAP_REINJECT_CONTROL diff --git a/qemu-kvm.h b/qemu-kvm.h index 7e6edfb..522b1b2 100644 --- a/qemu-kvm.h +++ b/qemu-kvm.h @@ -602,6 +602,9 @@ int kvm_assign_set_msix_entry(kvm_context_t kvm, struct kvm_assigned_msix_entry *entry); #endif +int kvm_assign_set_intx_mask(kvm_context_t kvm, + struct kvm_assigned_pci_dev *assigned_dev); + #else /* !CONFIG_KVM */ typedef struct kvm_context *kvm_context_t; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB Passthrough 1.1 performance problem...
Am 12.12.2010 23:31, Erik Brakkee wrote: Jan Kiszka wrote: Are there some tuning parameters I can use or perhaps even kernel configuration paramters on the host to solve this? Cheers Erik Host:Motherboard Supermicro X8DTi-F, Intel Xeon L5630, 12MB OS: Opensuse 11.3 64 bit Guest: OS: Opensuse 11.3 64 bit I can say now that I am giving up on getting this to work. One alternative was to use PCI passthrough the USB hardware, but that didn't work for the USB that was on the motherboard. So I bought a USB PCI card and tried to use PCI passthrough for that. Unfortunately other problems occured there. For one, the problem with 4K alignment. But I could fix that by using the pci=resource_alignment=... kernel parameter. In my grub/menu.lst it says: kernel /vmlinuz-2.6.34.7-0.5-default root=/dev/hsystem/root quiet showopts intel_iommu=on pci=resource_alignment=01:04.0;01:04.1;01:04.2 noirqdebug vga=0x31a The noirqdebug flas was needed to avoid the host from disabling the IRQ (it was a shared IRQ). Using this, I could configure PCI passthrough and start the VM. Also the USB device showed up there. Only it did not work at all. Here is a summary of my journey up until know: The original approach I wanted to use was to pass my old PCI card (WinTV PVR-500) to a VM. This card is a well supported card and has been doing fine for me. Because of the PCI passthrough problems with the wintv card, I decided to try a USB card instead. This gave me a 'ctrl buffer too small' issue that I could solve by taking the source RPM for kvm and applying a known patch from red hat (increasing buffer size from 2048 to 8192). But then I got jerky video, probably due to USB 1.1 issues. To bypass these I could use PCI passthrough for USB. But with the PCI passthrough of this card I am again running into issues probably related to Shared IRQs. So, after all this I am back to square one. I have now modified my approach so instead of running a separate minimal host with my old server as a guest, I am now running the old server (same install) on the new hardware, using it as a host. I would definitely be interested in trying this out further in the future. I even tried Xen for a brief moment, only to realize that my host and guest felt slower (slower startup and execution) and much more difficult to handle. From the experience of the last two days fulltime trying to get things working I can only conclude that the following two features would be really important to have: * Extended PCI passthrough support o shared IRQ support Addressed by the series I sent out today. Does this mean I have a chance now that PCI passthrough of my WinTV PVR-500 might work now? What version is this and where can I get this for opensuse? Currently you have to clone my git trees [1, 2], then build and install those to have the feature. Will take a while to see it in releases, and after that also Opensuse packages. Jan [1] git://git.kiszka.org/linux-kvm.git queues/dev-assign [2] git://git.kiszka.org/qemu-kvm.git queues/dev-assign signature.asc Description: OpenPGP digital signature
Re: USB Passthrough 1.1 performance problem...
2010/12/12 Erik Brakkee e...@brakkee.org: Does this mean I have a chance now that PCI passthrough of my WinTV PVR-500 might work now? Passthrough of a PVR-500 has been working for a long time. I've been running with passthrough of a PVR-500 in my HTPC, since November/December 2009...so it should work with any recent kernel and any recent version of qemu-kvm you can find today - No patching needed. The only issue I had with the PVR-500 card, was when *I* didn't free up the shared interrupts...once I fixed that, it just worked. On the other hand, I've never had success with passthrough of USB. I've spend a bunch of time trying to get various USB cards to work with passthrough, I even purchased 3 USB cards, just to test USB passthrough with different brands, interfaces and versions (PCI, PCIe, USB 2.0, USB 3.0, etc). I gave up on that 5 months ago - http://article.gmane.org/gmane.comp.emulators.kvm.devel/56719 What version is this and where can I get this for opensuse? I can't remember if I started out with the PVR-500 card with 0.11 or 0.12 ...I think it was 0.11...but anyway, you'll hopefully not run with such an old version today, so any version should work. Best regards Kenni -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/5] pci-assign: Fix dword read at PCI_COMMAND
On Tue, 2010-12-14 at 00:25 +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com If we emulate the command register, we must only read its content from the shadow config space. For dword read of both PCI_COMMAND and PCI_STATUS, at least the latter must be read from the device. For simplicity reasons and as the code path is not considered performance critical for the affected SRIOV devices, the fix performes device access to the command word unconditionally, even if emulation is enabled and only that word is read. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- hw/device-assignment.c | 14 +++--- 1 files changed, 11 insertions(+), 3 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index bc3a57b..6ff1456 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -494,14 +494,11 @@ static uint32_t assigned_dev_pci_read_config(PCIDevice *d, uint32_t address, /* * Catch access to * - vendor device ID - * - command register (if emulation needed) * - base address registers * - ROM base address capability pointer * - interrupt line pin */ if (ranges_overlap(address, len, PCI_VENDOR_ID, 4) || -(pci_dev-need_emulate_cmd - ranges_overlap(address, len, PCI_COMMAND, 2)) || ranges_overlap(address, len, PCI_BASE_ADDRESS_0, 24) || ranges_overlap(address, len, PCI_ROM_ADDRESS, 8) || ranges_overlap(address, len, PCI_INTERRUPT_LINE, 2)) { @@ -533,6 +530,17 @@ do_log: DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n, (d-devfn 3) 0x1F, (d-devfn 0x7), address, val, len); +if (pci_dev-need_emulate_cmd +ranges_overlap(address, len, PCI_COMMAND, 2)) { +if (address == PCI_COMMAND) { +val = 0x; +val |= pci_default_read_config(d, PCI_COMMAND, 2); +} else { +/* high-byte access */ +val = pci_default_read_config(d, PCI_COMMAND+1, 1); +} +} + if (!pci_dev-cap.available) { /* kill the special capabilities */ if (address == PCI_COMMAND len == 4) { We might be able to use the merge_bits function that I just added for capability support, perhaps something like: if (pci_dev-need_emulate_cmd) { val = merge_bits(val, pci_default_read_config(d, address, len), PCI_COMMAND, 0x) } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] pci-assign: Clean up assigned_dev_pci_read/write_config
On Tue, 2010-12-14 at 00:25 +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com Use rages_overlap and proper constants to match the access range against regions that need special handling. This also fixes yet uncaught high-byte write access to the command register. Moreover, use more constants instead of magic numbers. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- hw/device-assignment.c | 39 +-- 1 files changed, 29 insertions(+), 10 deletions(-) A long overdue cleanup, looks good. Acked-by: Alex Williamson alex.william...@redhat.com diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 50c6408..bc3a57b 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -438,13 +438,20 @@ static void assigned_dev_pci_write_config(PCIDevice *d, uint32_t address, return assigned_device_pci_cap_write_config(d, address, val, len); } -if (address == 0x4) { +if (ranges_overlap(address, len, PCI_COMMAND, 2)) { pci_default_write_config(d, address, val, len); /* Continue to program the card */ } -if ((address = 0x10 address = 0x24) || address == 0x30 || -address == 0x34 || address == 0x3c || address == 0x3d) { +/* + * Catch access to + * - base address registers + * - ROM base address capability pointer + * - interrupt line pin + */ +if (ranges_overlap(address, len, PCI_BASE_ADDRESS_0, 24) || +ranges_overlap(address, len, PCI_ROM_ADDRESS, 8) || +ranges_overlap(address, len, PCI_INTERRUPT_LINE, 2)) { /* used for update-mappings (BAR emulation) */ pci_default_write_config(d, address, val, len); return; @@ -484,9 +491,20 @@ static uint32_t assigned_dev_pci_read_config(PCIDevice *d, uint32_t address, return val; } -if (address 0x4 || (pci_dev-need_emulate_cmd address == 0x4) || - (address = 0x10 address = 0x24) || address == 0x30 || -address == 0x34 || address == 0x3c || address == 0x3d) { +/* + * Catch access to + * - vendor device ID + * - command register (if emulation needed) + * - base address registers + * - ROM base address capability pointer + * - interrupt line pin + */ +if (ranges_overlap(address, len, PCI_VENDOR_ID, 4) || +(pci_dev-need_emulate_cmd + ranges_overlap(address, len, PCI_COMMAND, 2)) || +ranges_overlap(address, len, PCI_BASE_ADDRESS_0, 24) || +ranges_overlap(address, len, PCI_ROM_ADDRESS, 8) || +ranges_overlap(address, len, PCI_INTERRUPT_LINE, 2)) { val = pci_default_read_config(d, address, len); DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n, (d-devfn 3) 0x1F, (d-devfn 0x7), address, val, len); @@ -517,10 +535,11 @@ do_log: if (!pci_dev-cap.available) { /* kill the special capabilities */ -if (address == 4 len == 4) -val = ~0x10; -else if (address == 6) -val = ~0x10; +if (address == PCI_COMMAND len == 4) { +val = ~(PCI_STATUS_CAP_LIST 16); +} else if (address == PCI_STATUS) { +val = ~PCI_STATUS_CAP_LIST; +} } return val; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] pci-assign: Remove suspicious hunk from assigned_dev_pci_read_config
On Tue, 2010-12-14 at 00:25 +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com No one can remember where this came from, and it looks very hacky anyway (we return 0 for config space address 0xfc of _every_ assigned device, not only vga as the comment claims). So better remove it and wait for the underlying issue to reappear. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- hw/device-assignment.c |5 - 1 files changed, 0 insertions(+), 5 deletions(-) Yay! Acked-by: Alex Williamson alex.william...@redhat.com diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 6ff1456..ef045f4 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -508,10 +508,6 @@ static uint32_t assigned_dev_pci_read_config(PCIDevice *d, uint32_t address, return val; } -/* vga specific, remove later */ -if (address == 0xFC) -goto do_log; - fd = pci_dev-real_device.config_fd; again: @@ -526,7 +522,6 @@ again: exit(1); } -do_log: DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n, (d-devfn 3) 0x1F, (d-devfn 0x7), address, val, len); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] pci-assign: Convert need_emulate_cmd into a bitmask
On Tue, 2010-12-14 at 00:25 +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com Define a mask of PCI command register bits that need to be emulated, i.e. read back from their shadow state. We will need this for selectively emulating the INTx mask bit. Note: No initialization of emulate_cmd_mask to zero needed, the device state is already zero-initialized. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- hw/device-assignment.c | 18 ++ hw/device-assignment.h |2 +- 2 files changed, 11 insertions(+), 9 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index ef045f4..26d3bd7 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -525,14 +525,17 @@ again: DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n, (d-devfn 3) 0x1F, (d-devfn 0x7), address, val, len); -if (pci_dev-need_emulate_cmd +if (pci_dev-emulate_cmd_mask ranges_overlap(address, len, PCI_COMMAND, 2)) { if (address == PCI_COMMAND) { -val = 0x; -val |= pci_default_read_config(d, PCI_COMMAND, 2); +val = ~pci_dev-emulate_cmd_mask; +val |= pci_default_read_config(d, PCI_COMMAND, 2) +pci_dev-emulate_cmd_mask; } else { /* high-byte access */ -val = pci_default_read_config(d, PCI_COMMAND+1, 1); +val = ~(pci_dev-emulate_cmd_mask 8); +val |= pci_default_read_config(d, PCI_COMMAND+1, 1) +(pci_dev-emulate_cmd_mask 8); } } We should definitely be using merge_bits here, this is the sort of thing I had in mind for it: val = merge_bits(val, pci_default_read_config(d, address, len), PCI_COMMAND, pci_dev-emulate_cmd_mask); @@ -800,10 +803,9 @@ again: /* dealing with virtual function device */ snprintf(name, sizeof(name), %sphysfn/, dir); -if (!stat(name, statbuf)) - pci_dev-need_emulate_cmd = 1; -else - pci_dev-need_emulate_cmd = 0; +if (!stat(name, statbuf)) { +pci_dev-emulate_cmd_mask = 0x; +} dev-region_number = r; return 0; diff --git a/hw/device-assignment.h b/hw/device-assignment.h index c94a730..9ead022 100644 --- a/hw/device-assignment.h +++ b/hw/device-assignment.h @@ -109,7 +109,7 @@ typedef struct AssignedDevice { void *msix_table_page; target_phys_addr_t msix_table_addr; int mmio_index; -int need_emulate_cmd; +uint32_t emulate_cmd_mask; char *configfd_name; QLIST_ENTRY(AssignedDevice) next; } AssignedDevice; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM call agenda for Dec 14
Please send in any agenda items you are interested in covering. thanks, -chris -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] pci-assign: Use PCI-2.3-based shared legacy interrupts
On Tue, 2010-12-14 at 00:25 +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com Enable the new KVM feature that allows legacy interrupt sharing for PCI-2.3-compliant devices. This requires to synchronize any guest change of the INTx mask bit to the kernel. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- hw/device-assignment.c | 38 +- qemu-kvm.c |8 qemu-kvm.h |3 +++ 3 files changed, 44 insertions(+), 5 deletions(-) diff --git a/hw/device-assignment.c b/hw/device-assignment.c index 26d3bd7..cf75c52 100644 --- a/hw/device-assignment.c +++ b/hw/device-assignment.c @@ -423,12 +423,21 @@ static uint8_t pci_find_cap_offset(PCIDevice *d, uint8_t cap, uint8_t start) return 0; } +static uint32_t calc_assigned_dev_id(uint16_t seg, uint8_t bus, uint8_t devfn) +{ +return (uint32_t)seg 16 | (uint32_t)bus 8 | (uint32_t)devfn; +} + static void assigned_dev_pci_write_config(PCIDevice *d, uint32_t address, uint32_t val, int len) { int fd; ssize_t ret; AssignedDevice *pci_dev = container_of(d, AssignedDevice, dev); +struct kvm_assigned_pci_dev assigned_dev_data; +#ifdef KVM_CAP_PCI_2_3 +bool intx_masked, update_intx_mask; +#endif /* KVM_CAP_PCI_2_3 */ DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n, ((d-devfn 3) 0x1F), (d-devfn 0x7), @@ -439,6 +448,26 @@ static void assigned_dev_pci_write_config(PCIDevice *d, uint32_t address, } if (ranges_overlap(address, len, PCI_COMMAND, 2)) { +#ifdef KVM_CAP_PCI_2_3 +update_intx_mask = false; +if (address == PCI_COMMAND+1) { +intx_masked = val (PCI_COMMAND_INTX_DISABLE 8); +update_intx_mask = true; +} else if (len = 2) { +intx_masked = val PCI_COMMAND_INTX_DISABLE; +update_intx_mask = true; +} I wonder if this might be a little cleaner as something like this. if (ranges_overlap(address, len, PCI_COMMAND + 1, 1) { update_intx_mask = true; intx_masked = (len == 1 ? val 8 : val) PCI_COMMAND_INTX_DISABLE; } +if (update_intx_mask) { +memset(assigned_dev_data, 0, sizeof(assigned_dev_data)); +assigned_dev_data.assigned_dev_id = +calc_assigned_dev_id(pci_dev-h_segnr, pci_dev-h_busnr, + pci_dev-h_devfn); +if (intx_masked) { +assigned_dev_data.flags = KVM_DEV_ASSIGN_MASK_INTX; +} +kvm_assign_set_intx_mask(kvm_context, assigned_dev_data); +} +#endif /* KVM_CAP_PCI_2_3 */ pci_default_write_config(d, address, val, len); /* Continue to program the card */ } @@ -876,11 +905,6 @@ static void free_assigned_device(AssignedDevice *dev) } } -static uint32_t calc_assigned_dev_id(uint16_t seg, uint8_t bus, uint8_t devfn) -{ -return (uint32_t)seg 16 | (uint32_t)bus 8 | (uint32_t)devfn; -} - static void assign_failed_examine(AssignedDevice *dev) { char name[PATH_MAX], dir[PATH_MAX], driver[PATH_MAX] = {}, *ns; @@ -971,6 +995,10 @@ static int assign_device(AssignedDevice *dev) cause host memory corruption if the device issues DMA write requests!\n); } +#ifdef KVM_CAP_PCI_2_3 +assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3; +dev-emulate_cmd_mask |= PCI_COMMAND_INTX_DISABLE; +#endif /* KVM_CAP_PCI_2_3 */ r = kvm_assign_pci_device(kvm_context, assigned_dev_data); if (r 0) { diff --git a/qemu-kvm.c b/qemu-kvm.c index 471306b..8157b4f 100644 --- a/qemu-kvm.c +++ b/qemu-kvm.c @@ -740,6 +740,14 @@ int kvm_deassign_pci_device(kvm_context_t kvm, } #endif +#ifdef KVM_CAP_PCI_2_3 +int kvm_assign_set_intx_mask(kvm_context_t kvm, + struct kvm_assigned_pci_dev *assigned_dev) +{ +return kvm_vm_ioctl(kvm_state, KVM_ASSIGN_SET_INTX_MASK, assigned_dev); +} +#endif + int kvm_reinject_control(kvm_context_t kvm, int pit_reinject) { #ifdef KVM_CAP_REINJECT_CONTROL diff --git a/qemu-kvm.h b/qemu-kvm.h index 7e6edfb..522b1b2 100644 --- a/qemu-kvm.h +++ b/qemu-kvm.h @@ -602,6 +602,9 @@ int kvm_assign_set_msix_entry(kvm_context_t kvm, struct kvm_assigned_msix_entry *entry); #endif +int kvm_assign_set_intx_mask(kvm_context_t kvm, + struct kvm_assigned_pci_dev *assigned_dev); + #else /* !CONFIG_KVM */ typedef struct kvm_context *kvm_context_t; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB Passthrough 1.1 performance problem...
2010/12/14 Erik Brakkee e...@brakkee.org: From: Kenni Lund ke...@kelu.dk Does this mean I have a chance now that PCI passthrough of my WinTV PVR-500 might work now? Passthrough of a PVR-500 has been working for a long time. I've been running with passthrough of a PVR-500 in my HTPC, since November/December 2009...so it should work with any recent kernel and any recent version of qemu-kvm you can find today - No patching needed. The only issue I had with the PVR-500 card, was when *I* didn't free up the shared interrupts...once I fixed that, it just worked. How did you free up those shared interrupts then? I tried different slots but always get conflicts with the USB irqs. I did an unbind of the conflicting device (eg. disabled it). I moved the PVR-500 card around in the different slots and once I got a conflict with the integrated sound card, I left the PVR-500 card in that slot (it's a headless machine, so no need for sound) and configured unbind of the sound card at boot time. On my old system I think it was conflicting with one of the USB controllers as well, but it didn't really matter, as I only lost a few of the ports on the back of the computer for that particular USB controller - I still had plenty of USB ports left and if I really needed more ports, I could just plug in an extra USB PCI card. My /etc/rc.local boot script looks like the following today: -- #Remove HDA conflicting with ivtv1 echo :00:1b.0 /sys/bus/pci/drivers/HDA\ Intel/unbind # ivtv0 echo 0016 /sys/bus/pci/drivers/pci-stub/new_id echo :04:08.0 /sys/bus/pci/drivers/ivtv/unbind echo :04:08.0 /sys/bus/pci/drivers/pci-stub/bind echo 0016 /sys/bus/pci/drivers/pci-stub/remove_id # ivtv1 echo 0016 /sys/bus/pci/drivers/pci-stub/new_id echo :04:09.0 /sys/bus/pci/drivers/ivtv/unbind echo :04:09.0 /sys/bus/pci/drivers/pci-stub/bind echo 0016 /sys/bus/pci/drivers/pci-stub/remove_id -- Best regards Kenni -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: MMU: don't make direct sp read-only if !map_writable
On 12/13/2010 06:32 PM, Avi Kivity wrote: On 12/13/2010 12:31 PM, Xiao Guangrong wrote: Currently, if the page is not allowed to write, then it can drop ACC_WRITE_MASK in pte_access, and the direct sp's access is: gw-pt_access gw-pte_access so, it also removes the write access in the direct sp. There is a problem: if the access of those pages which map thought the same mapping in guest is different in host, it causes host switch direct sp very frequently. I just sent a patch to fix this in a different way, please review it. Your patch is good for me, please ignore this one :-) Umm, do we need move access = ~ACC_WRITE_MASK into set_spte() then can remove the same code in the caller? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH] RFC: delay pci_update_mappings for 64-bit BARs
On Mon, Dec 13, 2010 at 03:43:44PM -0700, Cam Macdonell wrote: Do not call pci_update_mappings on the lower 32-bits of a 64-bit bar. Wait for the upper 32 or else Qemu will try to map on just the lower 32 which is probably going to corrupt memory. I was encountering crashes when mapping certain PCI region sizes. The problem turns out that pci_update_mappings is being called without all 64-bits in the BAR. For example when mapping to 0x18000, once the lower 32-bits were written the remapping happened (mapping to 0x800) which would overwrite something. I'm not certain if this is completely correct, I'm simply testing the lower 4-bits to only be MEM_TYPE_64 flag. Upper 32-bit address parts can be values like 0xff which is tricky to test against. You're assuming that guest OS always write lower 32bit and them upper 32bit. Is the assumption correct? I found Linux does, but I don't know about other OSes. And I couldn't find any sentence about how to update (64bit) BAR in the specs. (Please correct me if I missed it) Some work around would be necessary regardless of 32bit-or-64bit. because qemu doesn't emulate bus accurately at the moment. How about the followings? If BAR overlaps with RAM, don't map BAR. If BAR overlaps with other BARs, record the overlapping and when updating one of the BARs, update all the overlapping BARs. Which BAR wins depends on the order of updating, it doesn't matter because it's anomaly case. This way, 32bit BAR case is also covered. thanks, Cam --- hw/pci.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/hw/pci.c b/hw/pci.c index 438c0d1..3b81792 100644 --- a/hw/pci.c +++ b/hw/pci.c @@ -1000,6 +1000,9 @@ void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l) { int i, was_irq_disabled = pci_irq_disabled(d); uint32_t config_size = pci_config_size(d); +int is_64 = 0; + +is_64 = ((val 0xf) == PCI_BASE_ADDRESS_MEM_TYPE_64); for (i = 0; i l addr + i config_size; val = 8, ++i) { uint8_t wmask = d-wmask[addr + i]; @@ -1008,7 +1011,7 @@ void pci_default_write_config(PCIDevice *d, uint32_t addr, uint32_t val, int l) d-config[addr + i] = (d-config[addr + i] ~wmask) | (val wmask); d-config[addr + i] = ~(val w1cmask); /* W1C: Write 1 to Clear */ } -if (ranges_overlap(addr, l, PCI_BASE_ADDRESS_0, 24) || +if ((ranges_overlap(addr, l, PCI_BASE_ADDRESS_0, 24) (!is_64)) || ranges_overlap(addr, l, PCI_ROM_ADDRESS, 4) || ranges_overlap(addr, l, PCI_ROM_ADDRESS1, 4) || range_covers_byte(addr, l, PCI_COMMAND)) -- 1.7.0.4 -- yamahata -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC -v2 PATCH 1/3] kvm: keep track of which task is running a KVM vcpu
Keep track of which task is running a KVM vcpu. This helps us figure out later what task to wake up if we want to boost a vcpu that got preempted. Unfortunately there are no guarantees that the same task always keeps the same vcpu, so we can only track the task across a single run of the vcpu. Signed-off-by: Rik van Riel r...@redhat.com --- - move vcpu-task manipulation as suggested by Chris Wright include/linux/kvm_host.h |1 + virt/kvm/kvm_main.c |2 ++ 2 files changed, 3 insertions(+), 0 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index a055742..180085b 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -81,6 +81,7 @@ struct kvm_vcpu { #endif int vcpu_id; struct mutex mutex; + struct task_struct *task; int cpu; atomic_t guest_mode; struct kvm_run *run; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 5225052..c95bad1 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2248,6 +2248,7 @@ static void kvm_sched_in(struct preempt_notifier *pn, int cpu) { struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn); + vcpu-task = NULL; kvm_arch_vcpu_load(vcpu, cpu); } @@ -2256,6 +2257,7 @@ static void kvm_sched_out(struct preempt_notifier *pn, { struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn); + vcpu-task = current; kvm_arch_vcpu_put(vcpu); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC -v2 PATCH 2/3] sched: add yield_to function
Add a yield_to function to the scheduler code, allowing us to give the remainder of our timeslice to another thread. We may want to use this to provide a sys_yield_to system call one day. Signed-off-by: Rik van Riel r...@redhat.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com --- - move to a per sched class yield_to - fix the locking diff --git a/include/linux/sched.h b/include/linux/sched.h index 2c79e92..408326f 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1086,6 +1086,8 @@ struct sched_class { #ifdef CONFIG_FAIR_GROUP_SCHED void (*task_move_group) (struct task_struct *p, int on_rq); #endif + + void (*yield_to) (struct rq *rq, struct task_struct *p); }; struct load_weight { @@ -1947,6 +1949,7 @@ extern void set_user_nice(struct task_struct *p, long nice); extern int task_prio(const struct task_struct *p); extern int task_nice(const struct task_struct *p); extern int can_nice(const struct task_struct *p, const int nice); +extern void requeue_task(struct rq *rq, struct task_struct *p); extern int task_curr(const struct task_struct *p); extern int idle_cpu(int cpu); extern int sched_setscheduler(struct task_struct *, int, struct sched_param *); @@ -2020,6 +2023,10 @@ extern int wake_up_state(struct task_struct *tsk, unsigned int state); extern int wake_up_process(struct task_struct *tsk); extern void wake_up_new_task(struct task_struct *tsk, unsigned long clone_flags); + +extern u64 slice_remain(struct task_struct *); +extern void yield_to(struct task_struct *); + #ifdef CONFIG_SMP extern void kick_process(struct task_struct *tsk); #else diff --git a/kernel/sched.c b/kernel/sched.c index dc91a4d..6399641 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -5166,6 +5166,46 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned int, len, return ret; } +/* + * Yield the CPU, giving the remainder of our time slice to task p. + * Typically used to hand CPU time to another thread inside the same + * process, eg. when p holds a resource other threads are waiting for. + * Giving priority to p may help get that resource released sooner. + */ +void yield_to(struct task_struct *p) +{ + unsigned long flags; + struct rq *rq, *p_rq; + + local_irq_save(flags); + rq = this_rq(); +again: + p_rq = task_rq(p); + double_rq_lock(rq, p_rq); + if (p_rq != task_rq(p)) { + double_rq_unlock(rq, p_rq); + goto again; + } + + /* We can't yield to a process that doesn't want to run. */ + if (!p-se.on_rq) + goto out; + + /* +* We can only yield to a runnable task, in the same schedule class +* as the current task, if the schedule class implements yield_to_task. +*/ + if (!task_running(rq, p) current-sched_class == p-sched_class + current-sched_class-yield_to) + current-sched_class-yield_to(rq, p); + +out: + double_rq_unlock(rq, p_rq); + local_irq_restore(flags); + yield(); +} +EXPORT_SYMBOL_GPL(yield_to); + /** * sys_sched_yield - yield the current processor to other threads. * diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 00ebd76..d8c4116 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -980,6 +980,25 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) * CFS operations on tasks: */ +u64 slice_remain(struct task_struct *p) +{ + unsigned long flags; + struct sched_entity *se = p-se; + struct cfs_rq *cfs_rq; + struct rq *rq; + u64 slice, ran; + s64 delta; + + rq = task_rq_lock(p, flags); + cfs_rq = cfs_rq_of(se); + slice = sched_slice(cfs_rq, se); + ran = se-sum_exec_runtime - se-prev_sum_exec_runtime; + delta = slice - ran; + task_rq_unlock(rq, flags); + + return max(delta, 0LL); +} + #ifdef CONFIG_SCHED_HRTICK static void hrtick_start_fair(struct rq *rq, struct task_struct *p) { @@ -1126,6 +1145,20 @@ static void yield_task_fair(struct rq *rq) se-vruntime = rightmost-vruntime + 1; } +static void yield_to_fair(struct rq *rq, struct task_struct *p) +{ + struct sched_entity *se = p-se; + struct cfs_rq *cfs_rq = cfs_rq_of(se); + u64 remain = slice_remain(current); + + dequeue_task(rq, p, 0); + se-vruntime -= remain; + if (se-vruntime cfs_rq-min_vruntime) + se-vruntime = cfs_rq-min_vruntime; + enqueue_task(rq, p, 0); + check_preempt_curr(rq, p, 0); +} + #ifdef CONFIG_SMP static void task_waking_fair(struct rq *rq, struct task_struct *p) @@ -3962,6 +3995,8 @@ static const struct sched_class fair_sched_class = { #ifdef CONFIG_FAIR_GROUP_SCHED .task_move_group= task_move_group_fair, #endif + + .yield_to = yield_to_fair, }; #ifdef CONFIG_SCHED_DEBUG -- To unsubscribe from this list:
[RFC -v2 PATCH 0/3] directed yield for Pause Loop Exiting
When running SMP virtual machines, it is possible for one VCPU to be spinning on a spinlock, while the VCPU that holds the spinlock is not currently running, because the host scheduler preempted it to run something else. Both Intel and AMD CPUs have a feature that detects when a virtual CPU is spinning on a lock and will trap to the host. The current KVM code sleeps for a bit whenever that happens, which results in eg. a 64 VCPU Windows guest taking forever and a bit to boot up. This is because the VCPU holding the lock is actually running and not sleeping, so the pause is counter-productive. In other workloads a pause can also be counter-productive, with spinlock detection resulting in one guest giving up its CPU time to the others. Instead of spinning, it ends up simply not running much at all. This patch series aims to fix that, by having a VCPU that spins give the remainder of its timeslice to another VCPU in the same guest before yielding the CPU - one that is runnable but got preempted, hopefully the lock holder. v2: - make lots of cleanups and improvements suggested - do not implement timeslice scheduling or fairness stuff yet, since it is not entirely clear how to do that right (suggestions welcome) -- All rights reversed. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC -v2 PATCH 3/3] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin
Instead of sleeping in kvm_vcpu_on_spin, which can cause gigantic slowdowns of certain workloads, we instead use yield_to to hand the rest of our timeslice to another vcpu in the same KVM guest. Signed-off-by: Rik van Riel r...@redhat.com Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 180085b..af11701 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -92,6 +92,7 @@ struct kvm_vcpu { int fpu_active; int guest_fpu_loaded, guest_xcr0_loaded; wait_queue_head_t wq; + int spinning; int sigset_active; sigset_t sigset; struct kvm_vcpu_stat stat; @@ -187,6 +188,7 @@ struct kvm { #endif struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; atomic_t online_vcpus; + int last_boosted_vcpu; struct list_head vm_list; struct mutex lock; struct kvm_io_bus *buses[KVM_NR_BUSES]; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index c95bad1..17c6c25 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1289,18 +1289,50 @@ void kvm_resched(struct kvm_vcpu *vcpu) } EXPORT_SYMBOL_GPL(kvm_resched); -void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu) +void kvm_vcpu_on_spin(struct kvm_vcpu *me) { - ktime_t expires; - DEFINE_WAIT(wait); + struct kvm *kvm = me-kvm; + struct kvm_vcpu *vcpu; + int last_boosted_vcpu = me-kvm-last_boosted_vcpu; + int yielded = 0; + int pass; + int i; - prepare_to_wait(vcpu-wq, wait, TASK_INTERRUPTIBLE); + me-spinning = 1; - /* Sleep for 100 us, and hope lock-holder got scheduled */ - expires = ktime_add_ns(ktime_get(), 10UL); - schedule_hrtimeout(expires, HRTIMER_MODE_ABS); + /* +* We boost the priority of a VCPU that is runnable but not +* currently running, because it got preempted by something +* else and called schedule in __vcpu_run. Hopefully that +* VCPU is holding the lock that we need and will release it. +* We approximate round-robin by starting at the last boosted VCPU. +*/ + for (pass = 0; pass 2 !yielded; pass++) { + kvm_for_each_vcpu(i, vcpu, kvm) { + struct task_struct *task = vcpu-task; + if (!pass i last_boosted_vcpu) { + i = last_boosted_vcpu; + continue; + } else if (pass i last_boosted_vcpu) + break; + if (vcpu == me) + continue; + if (vcpu-spinning) + continue; + if (!task) + continue; + if (waitqueue_active(vcpu-wq)) + continue; + if (task-flags PF_VCPU) + continue; + kvm-last_boosted_vcpu = i; + yielded = 1; + yield_to(task); + break; + } + } - finish_wait(vcpu-wq, wait); + me-spinning = 0; } EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL net-next-2.6] vhost-net: tools, cleanups, optimizations
On Tue, 14 Dec 2010 03:54:47 am Michael S. Tsirkin wrote: On Mon, Dec 13, 2010 at 12:44:13PM +0200, Michael S. Tsirkin wrote: Please merge the following tree for 2.6.38. Thanks! Um, I sent this out before I noticed the mail from Rusty with some questions on the test code. I missed that and assumed no comments - no issues, perhaps wrongly. Rusty - I tried answering the questions there - any issues with merging this? It's just a test so won't be hard to remove later if it's not helpful ... Traditionally this stuff has not gone in tree. However, I think it should be... Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
SMBIOS support in Qemu?
Hi, Which version of Qemu contains the Smbios code? If I have to get the code in my repo, is there any place I can get the complete set of patches? Thanks Anjali -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html