Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting
On Tue, Feb 05, 2013 at 05:57:14AM +, Zhang, Yang Z wrote: Marcelo Tosatti wrote on 2013-02-05: On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote: On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote: On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote: Any example how software relies on such two-interrupts-queued-in-IRR/ISR behaviour? Don't know about guests, but KVM relies on it to detect interrupt coalescing. So if interrupt is set in IRR but not in PIR interrupt will not be reported as coalesced, but it will be coalesced during PIR-IRR merge. Yes, so: 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no. 2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer. 3. vcpu outside of guest mode. 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no. 5. vcpu enters guest mode. 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no. 7. HW transfers PIR into IRR. set_irq return value at 7 is incorrect, interrupt event was _not_ queued. Not sure I understand the flow of events in your description correctly. As I understand it at 4 set_irq() will return incorrect result. Basically when PIR is set to 1 while IRR has 1 for the vector the value of set_irq() will be incorrect. At 4 it has not been coalesced: it has been queued to IRR. At 6 it has been coalesced: PIR bit merged into IRR bit. Frankly I do not see how it can be fixed without any race with present HW PIR design. At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR already set, don't set PIR. Or: apic_accept_interrupt() { 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR. Never set IRR when HWAPIC enabled, even if outside of guest mode. 2. Set PIR and let HW or SW VM-entry transfer it to IRR. 3. set_irq return value: (ORIG_PIR or ORIG_IRR set). } Two or more concurrent set_irq can race with each other, though. Can either document the race or add a lock. According the SDM, software should not touch the IRR when target vcpu is running. Instead, use locked way to access PIR. So your solution may wrong. Then your apicv patches are broken, because they do exactly that. The only problem is the step 6, but at that point, there already an interrupt pending in IRR. This means the interrupt will be handled not lost. And even in real hardware, this case do exist. So I think it should not be a problem. This is not the problem we are trying to fix. Sometimes we need to make sure that each interrupt device generates result in an interrupt handler invocation in a guest. If interrupt is coalesced (meaning it will not correspond to separate invocation of a guest interrupt handler) it needs to be re-injected. With PIR detection of such condition is broken. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices
On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote: On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R vijaymohan.pandarat...@hp.com wrote: - Create eventfd per vfio device assigned to a guest and register an event handler - This fd is passed to the vfio_pci driver through the SET_IRQ ioctl - When the device encounters an error, the eventfd is signalled and the qemu eventfd handler gets invoked. - In the handler decide what action to take. Current action taken is to terminate the guest. Usually this is not OK, but I guess this is not guest triggerable. Still not OK. Why not stop a guest with appropriate stop reason? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
I may have found a bug(unlikely) with Qemu-kvm 1.1
We are upgrading, carefully, from 1.0 to 1.1. We welcome the improved AC97 support and the loss of the ehci warning message on startup. I am finding an issue getting through to the monitor however. Neither ctrl-alt-shift-2 nor ctrl-alt-2 exposes the monitor. I have even tried using the -mon option code examples to force a monitor to stdio, but the monitor stubbornly refuses to appear. Any advice would be greatly appreciated. Simon -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: win8 installation iso can not boot on rhel6.2 kernel?
On Tue, Feb 05, 2013 at 02:55:07PM +0800, ya su wrote: I use the following cmd on rhel6.2 kernel 2.6.32-220.17.1: x86_64-softmmu/qemu-system-x86_64 -hda win8.img -cdrom window_8_pro.iso -m 2048 -L pc-bios -cpu host, it will display the following error: Your PC needs to restart. Please hold down the power button. Error Code: 0x005D Parameters: 0x03100A00 0x68747541 0x69746E65 0x444D4163 I also tried the newest rhel6 kernel version: 2.6.32-279.19.1, it bring out the same result. if I try standard kernel 2.6.32 version, it can boot normally. Any suggestions? thanks. You need latest RHEL6.4 kernel. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6] Revert KVM: MMU: split kvm_mmu_free_page
On Mon, Feb 04, 2013 at 11:24:01PM -0200, Marcelo Tosatti wrote: On Wed, Jan 30, 2013 at 04:45:05PM +0200, Gleb Natapov wrote: This reverts commit bd4c86eaa6ff10abc4e00d0f45d2a28b10b09df4. There is not user for kvm_mmu_isolate_page() any more. Signed-off-by: Gleb Natapov g...@redhat.com --- arch/x86/kvm/mmu.c | 21 +++-- Applied all except this - Takuya's using it. He actually revers this commit, but in non-obvious way. Reverting it explicitly makes for more clear commit history. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: I may have found a bug(unlikely) with Qemu-kvm 1.1
Veruca Salt verucasal...@hotmail.co.uk writes: We are upgrading, carefully, from 1.0 to 1.1. Keep going; we're about to release 1.4 ;) We welcome the improved AC97 support and the loss of the ehci warning message on startup. I am finding an issue getting through to the monitor however. Neither ctrl-alt-shift-2 nor ctrl-alt-2 exposes the monitor. I never use that one, but it should work. I have even tried using the -mon option code examples to force a monitor to stdio, but the monitor stubbornly refuses to appear. Any advice would be greatly appreciated. These all work for me: * Shorthand syntax -monitor stdio * Equivalent longhand -chardev stdio,id=mon0 -mon chardev=mon0,mode=readline,default=on * Same as config file snippet, for use with -readconfig [chardev mon0] backend = stdio [mon] chardev = mon0 mode = readline default = on -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 0/5] KVM: MMU: unify and cleanup the code of walking pte list
Current code has two ways to walk pte_list, the one is pte_list_walk and the another way is rmap_get_first and rmap_get_next, they have the same logic. This patchset tries to unify the code and also make the code more tidy. Patch 1: KVM: MMU: introduce mmu_spte_establish, which tries to eliminates the different between walking parent pte list and rmap, prepare for the later patch. Patch 2: KVM: MMU: clarify the logic in kvm_set_pte_rmapp, which prepares for the next patch, no logic changed. Patch 3: KVM: MMU: unify the code of walking pte list, unify the walking code. Patch 4: KVM: MMU: fix spte assertion, fix a minor bug and remove the duplicate code. Patch 5: KVM: MMU: fast drop all spte on the pte_list, optimize for dropping all sptes on rmap and remove all the goto restart pattern introduced by the Patch 3. Marcelo, Gleb, please apply them after applying the patchset of [PATCH v3 0/3] KVM: MMU: simple cleanups. Changelog: v3: - address Gleb's comments, remove the remained goto restart in kvm_set_pte_rmapp - improve the changelog Thanks! -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 1/5] KVM: MMU: introduce mmu_spte_establish
There is little different between walking parent pte and walking ramp: all spte in rmap must be present but this is not true on parent pte list, in kvm_mmu_alloc_page, we always link the parent list before set the spte to present This patch introduce mmu_spte_establish which set the spte before linking it to parent list to eliminates the different then it is possible to unify the code of walking pte list Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c | 81 ++- arch/x86/kvm/paging_tmpl.h | 16 - 2 files changed, 48 insertions(+), 49 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 8041454..68d4d5f 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1482,9 +1482,6 @@ static unsigned kvm_page_table_hashfn(gfn_t gfn) static void mmu_page_add_parent_pte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp, u64 *parent_pte) { - if (!parent_pte) - return; - pte_list_add(vcpu, parent_pte, sp-parent_ptes); } @@ -1502,7 +1499,7 @@ static void drop_parent_pte(struct kvm_mmu_page *sp, } static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, - u64 *parent_pte, int direct) + int direct) { struct kvm_mmu_page *sp; sp = mmu_memory_cache_alloc(vcpu-arch.mmu_page_header_cache); @@ -1512,7 +1509,6 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, set_page_private(virt_to_page(sp-spt), (unsigned long)sp); list_add(sp-link, vcpu-kvm-arch.active_mmu_pages); sp-parent_ptes = 0; - mmu_page_add_parent_pte(vcpu, sp, parent_pte); kvm_mod_used_mmu_pages(vcpu-kvm, +1); return sp; } @@ -1845,8 +1841,7 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, gva_t gaddr, unsigned level, int direct, -unsigned access, -u64 *parent_pte) +unsigned access) { union kvm_mmu_page_role role; unsigned quadrant; @@ -1876,19 +1871,15 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, if (sp-unsync kvm_sync_page_transient(vcpu, sp)) break; - mmu_page_add_parent_pte(vcpu, sp, parent_pte); - if (sp-unsync_children) { + if (sp-unsync_children) kvm_make_request(KVM_REQ_MMU_SYNC, vcpu); - kvm_mmu_mark_parents_unsync(sp); - } else if (sp-unsync) - kvm_mmu_mark_parents_unsync(sp); __clear_sp_write_flooding_count(sp); trace_kvm_mmu_get_page(sp, false); return sp; } ++vcpu-kvm-stat.mmu_cache_miss; - sp = kvm_mmu_alloc_page(vcpu, parent_pte, direct); + sp = kvm_mmu_alloc_page(vcpu, direct); if (!sp) return sp; sp-gfn = gfn; @@ -1908,6 +1899,35 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu, return sp; } +static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp) +{ + u64 spte; + + spte = __pa(sp-spt) | PT_PRESENT_MASK | PT_WRITABLE_MASK | + shadow_user_mask | shadow_x_mask | shadow_accessed_mask; + + mmu_spte_set(sptep, spte); +} + +static struct kvm_mmu_page * +mmu_spte_establish(struct kvm_vcpu *vcpu, u64 *spte, gfn_t gfn, gva_t gaddr, + unsigned level, int direct, unsigned access) +{ + struct kvm_mmu_page *sp; + + WARN_ON(is_shadow_present_pte(*spte)); + + sp = kvm_mmu_get_page(vcpu, gfn, gaddr, level, direct, access); + + link_shadow_page(spte, sp); + mmu_page_add_parent_pte(vcpu, sp, spte); + + if (sp-unsync_children || sp-unsync) + kvm_mmu_mark_parents_unsync(sp); + + return sp; +} + static void shadow_walk_init(struct kvm_shadow_walk_iterator *iterator, struct kvm_vcpu *vcpu, u64 addr) { @@ -1957,16 +1977,6 @@ static void shadow_walk_next(struct kvm_shadow_walk_iterator *iterator) return __shadow_walk_next(iterator, *iterator-sptep); } -static void link_shadow_page(u64 *sptep, struct kvm_mmu_page *sp) -{ - u64 spte; - - spte = __pa(sp-spt) | PT_PRESENT_MASK | PT_WRITABLE_MASK | - shadow_user_mask | shadow_x_mask | shadow_accessed_mask; - - mmu_spte_set(sptep, spte); -} - static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep, unsigned direct_access) { @@ -2023,11 +2033,6 @@ static void kvm_mmu_page_unlink_children(struct
[PATCH v3 2/5] KVM: MMU: clarify the logic in kvm_set_pte_rmapp
In kvm_set_pte_rmapp, if the new mapping is writable, we need to remove all spte pointing to that page otherwisewe we only need to adjust the sptes to let them point to the new page. This patch clarifys the logic and makes the later patch more clean [ Impact: no logic changed ] Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c | 17 - 1 files changed, 8 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 68d4d5f..a0dc0d7 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1225,16 +1225,16 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp, WARN_ON(pte_huge(*ptep)); new_pfn = pte_pfn(*ptep); - for (sptep = rmap_get_first(*rmapp, iter); sptep;) { - BUG_ON(!is_shadow_present_pte(*sptep)); - rmap_printk(kvm_set_pte_rmapp: spte %p %llx\n, sptep, *sptep); + if (pte_write(*ptep)) + need_flush = kvm_unmap_rmapp(kvm, rmapp, slot, data); + else + for (sptep = rmap_get_first(*rmapp, iter); sptep;) { + BUG_ON(!is_shadow_present_pte(*sptep)); + rmap_printk(kvm_set_pte_rmapp: spte %p %llx\n, + sptep, *sptep); - need_flush = 1; + need_flush = 1; - if (pte_write(*ptep)) { - drop_spte(kvm, sptep); - sptep = rmap_get_first(*rmapp, iter); - } else { new_spte = *sptep ~PT64_BASE_ADDR_MASK; new_spte |= (u64)new_pfn PAGE_SHIFT; @@ -1246,7 +1246,6 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp, mmu_spte_set(sptep, new_spte); sptep = rmap_get_next(iter); } - } if (need_flush) kvm_flush_remote_tlbs(kvm); -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 3/5] KVM: MMU: unify the code of walking pte list
Current code has two ways to walk pte_list, the one is pte_list_walk and the another way is rmap_get_first and rmap_get_next, they have the same logic. This patch introduces for_each_spte_in_pte_list to integrate their code [ Impact: no logic changed, most of the change is function/struct rename ] Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c | 178 ++ arch/x86/kvm/mmu_audit.c |5 +- 2 files changed, 86 insertions(+), 97 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index a0dc0d7..2291ea3 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -945,26 +945,75 @@ static void pte_list_remove(u64 *spte, unsigned long *pte_list) } } -typedef void (*pte_list_walk_fn) (u64 *spte); -static void pte_list_walk(unsigned long *pte_list, pte_list_walk_fn fn) +/* + * Used by the following functions to iterate through the sptes linked by a + * pte_list. All fields are private and not assumed to be used outside. + */ +struct pte_list_iterator { + /* private fields */ + struct pte_list_desc *desc; /* holds the sptep if not NULL */ + int pos;/* index of the sptep */ +}; + +/* + * Iteration must be started by this function. This should also be used after + * removing/dropping sptes from the pte_list link because in such cases the + * information in the itererator may not be valid. + * + * Returns sptep if found, NULL otherwise. + */ +static u64 *pte_list_get_first(unsigned long pte_list, + struct pte_list_iterator *iter) { - struct pte_list_desc *desc; - int i; + if (!pte_list) + return NULL; - if (!*pte_list) - return; + if (!(pte_list 1)) { + iter-desc = NULL; + return (u64 *)pte_list; + } + + iter-desc = (struct pte_list_desc *)(pte_list ~1ul); + iter-pos = 0; + return iter-desc-sptes[iter-pos]; +} + +/* + * Must be used with a valid iterator: e.g. after pte_list_get_next(). + * + * Returns sptep if found, NULL otherwise. + */ +static u64 *pte_list_get_next(struct pte_list_iterator *iter) +{ + if (iter-desc) { + if (iter-pos PTE_LIST_EXT - 1) { + u64 *sptep; + + ++iter-pos; + sptep = iter-desc-sptes[iter-pos]; + if (sptep) + return sptep; + } - if (!(*pte_list 1)) - return fn((u64 *)*pte_list); + iter-desc = iter-desc-more; - desc = (struct pte_list_desc *)(*pte_list ~1ul); - while (desc) { - for (i = 0; i PTE_LIST_EXT desc-sptes[i]; ++i) - fn(desc-sptes[i]); - desc = desc-more; + if (iter-desc) { + iter-pos = 0; + /* desc-sptes[0] cannot be NULL */ + return iter-desc-sptes[iter-pos]; + } } + + return NULL; } +#define for_each_spte_in_pte_list(pte_list, iter, spte)\ + for (spte = pte_list_get_first(pte_list, (iter)); \ + spte != NULL; spte = pte_list_get_next((iter))) + +#define for_each_spte_in_rmap(rmap, iter, spte)\ + for_each_spte_in_pte_list(rmap, iter, spte) + static unsigned long *__gfn_to_rmap(gfn_t gfn, int level, struct kvm_memory_slot *slot) { @@ -1016,67 +1065,6 @@ static void rmap_remove(struct kvm *kvm, u64 *spte) pte_list_remove(spte, rmapp); } -/* - * Used by the following functions to iterate through the sptes linked by a - * rmap. All fields are private and not assumed to be used outside. - */ -struct rmap_iterator { - /* private fields */ - struct pte_list_desc *desc; /* holds the sptep if not NULL */ - int pos;/* index of the sptep */ -}; - -/* - * Iteration must be started by this function. This should also be used after - * removing/dropping sptes from the rmap link because in such cases the - * information in the itererator may not be valid. - * - * Returns sptep if found, NULL otherwise. - */ -static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator *iter) -{ - if (!rmap) - return NULL; - - if (!(rmap 1)) { - iter-desc = NULL; - return (u64 *)rmap; - } - - iter-desc = (struct pte_list_desc *)(rmap ~1ul); - iter-pos = 0; - return iter-desc-sptes[iter-pos]; -} - -/* - * Must be used with a valid iterator: e.g. after rmap_get_first(). - * - * Returns sptep if found, NULL otherwise. - */ -static u64 *rmap_get_next(struct rmap_iterator *iter) -{ - if (iter-desc) { - if (iter-pos PTE_LIST_EXT - 1) { - u64 *sptep; - -
[PATCH v3 4/5] KVM: MMU: fix spte assertion
PT_PRESENT_MASK bit is not enough to see the spte has already been mapped into pte-list for mmio spte also set this bit. Use is_shadow_present_pte instead to fix it Also, this patch move many assertions to the common place to clean up the code Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c | 20 ++-- 1 files changed, 6 insertions(+), 14 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 2291ea3..58f813a 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1009,7 +1009,9 @@ static u64 *pte_list_get_next(struct pte_list_iterator *iter) #define for_each_spte_in_pte_list(pte_list, iter, spte)\ for (spte = pte_list_get_first(pte_list, (iter)); \ - spte != NULL; spte = pte_list_get_next((iter))) + spte != NULL\ + ({WARN_ON(!is_shadow_present_pte(*(spte))); 1; });\ + spte = pte_list_get_next(iter)) #define for_each_spte_in_rmap(rmap, iter, spte)\ for_each_spte_in_pte_list(rmap, iter, spte) @@ -1128,11 +1130,8 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp, struct pte_list_iterator iter; bool flush = false; - for_each_spte_in_rmap(*rmapp, iter, sptep) { - BUG_ON(!(*sptep PT_PRESENT_MASK)); - + for_each_spte_in_rmap(*rmapp, iter, sptep) flush |= spte_write_protect(kvm, sptep, pt_protect); - } return flush; } @@ -1215,7 +1214,6 @@ static int kvm_set_pte_rmapp(struct kvm *kvm, unsigned long *rmapp, need_flush = kvm_unmap_rmapp(kvm, rmapp, slot, data); else for_each_spte_in_rmap(*rmapp, iter, sptep) { - BUG_ON(!is_shadow_present_pte(*sptep)); rmap_printk(kvm_set_pte_rmapp: spte %p %llx\n, sptep, *sptep); @@ -1336,15 +1334,12 @@ static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp, goto out; } - for_each_spte_in_rmap(*rmapp, iter, sptep) { - BUG_ON(!is_shadow_present_pte(*sptep)); - + for_each_spte_in_rmap(*rmapp, iter, sptep) if (*sptep shadow_accessed_mask) { young = 1; clear_bit((ffs(shadow_accessed_mask) - 1), (unsigned long *)sptep); } - } out: /* @data has hva passed to kvm_age_hva(). */ trace_kvm_age_page(data, slot, young); @@ -1366,14 +1361,11 @@ static int kvm_test_age_rmapp(struct kvm *kvm, unsigned long *rmapp, if (!shadow_accessed_mask) goto out; - for_each_spte_in_rmap(*rmapp, iter, sptep) { - BUG_ON(!is_shadow_present_pte(*sptep)); - + for_each_spte_in_rmap(*rmapp, iter, sptep) if (*sptep shadow_accessed_mask) { young = 1; break; } - } out: return young; } -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 5/5] KVM: MMU: fast drop all spte on the pte_list
If a shadow page is being zapped or a host page is going to be freed, kvm will drop all the reverse-mappings on the shadow page or the gfn. Currently, it drops the reverse-mapping one by one - it deletes the first reverse mapping, then moves other reverse-mapping between the description-table. When the last description-table become empty, it will be freed. It works well if we only have a few reverse-mappings, but some pte_lists are very long, during my tracking, i saw some gfns have more than 200 sptes listed on its pte-list (1G memory in guest on softmmu). Optimization for dropping such long pte-list is worthwhile, at lease it is good for deletion memslots and ksm/thp merge pages. This patch introduce a better way to optimize for this case, it walks all the reverse-mappings and clear them, then free all description-tables together. Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com --- arch/x86/kvm/mmu.c | 36 +++- 1 files changed, 27 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 58f813a..aa7a887 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -945,6 +945,25 @@ static void pte_list_remove(u64 *spte, unsigned long *pte_list) } } +static void pte_list_destroy(unsigned long *pte_list) +{ + struct pte_list_desc *desc; + unsigned long list_value = *pte_list; + + *pte_list = 0; + + if (!(list_value 1)) + return; + + desc = (struct pte_list_desc *)(list_value ~1ul); + while (desc) { + struct pte_list_desc *next_desc = desc-more; + + mmu_free_pte_list_desc(desc); + desc = next_desc; + } +} + /* * Used by the following functions to iterate through the sptes linked by a * pte_list. All fields are private and not assumed to be used outside. @@ -1183,17 +1202,17 @@ static bool rmap_write_protect(struct kvm *kvm, u64 gfn) static int kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp, struct kvm_memory_slot *slot, unsigned long data) { - u64 *sptep; struct pte_list_iterator iter; + u64 *sptep; int need_tlb_flush = 0; -restart: for_each_spte_in_rmap(*rmapp, iter, sptep) { - drop_spte(kvm, sptep); + mmu_spte_clear_track_bits(sptep); need_tlb_flush = 1; - goto restart; } + pte_list_destroy(rmapp); + return need_tlb_flush; } @@ -2016,11 +2035,10 @@ static void kvm_mmu_unlink_parents(struct kvm *kvm, struct kvm_mmu_page *sp) u64 *sptep; struct pte_list_iterator iter; -restart: - for_each_spte_in_pte_list(sp-parent_ptes, iter, sptep) { - drop_parent_pte(sp, sptep); - goto restart; - } + for_each_spte_in_pte_list(sp-parent_ptes, iter, sptep) + mmu_spte_clear_no_track(sptep); + + pte_list_destroy(sp-parent_ptes); } static int mmu_zap_unsync_children(struct kvm *kvm, -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: win8 installation iso can not boot on rhel6.2 kernel?
Gleb: would you pls tell where to find RHEL 6.4 kernel, as the current latest office release is 6.3? and would you figure out what the root cause that produce the problem? thanks. Regards. Suya 2013/2/5 ya su suya94...@gmail.com: Gleb: would you pls tell where to find RHEL 6.4 kernel, as the current latest office release is 6.3? and would you figure out what the root cause that produce the problem? thanks. Regards. Suya 2013/2/5 Gleb Natapov g...@redhat.com: On Tue, Feb 05, 2013 at 02:55:07PM +0800, ya su wrote: I use the following cmd on rhel6.2 kernel 2.6.32-220.17.1: x86_64-softmmu/qemu-system-x86_64 -hda win8.img -cdrom window_8_pro.iso -m 2048 -L pc-bios -cpu host, it will display the following error: Your PC needs to restart. Please hold down the power button. Error Code: 0x005D Parameters: 0x03100A00 0x68747541 0x69746E65 0x444D4163 I also tried the newest rhel6 kernel version: 2.6.32-279.19.1, it bring out the same result. if I try standard kernel 2.6.32 version, it can boot normally. Any suggestions? thanks. You need latest RHEL6.4 kernel. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: win8 installation iso can not boot on rhel6.2 kernel?
On Tue, Feb 05, 2013 at 04:55:57PM +0800, ya su wrote: Gleb: would you pls tell where to find RHEL 6.4 kernel, as the current latest office release is 6.3? and would you figure out what the root cause that produce the problem? thanks. Through your RHEL subscription. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices
-Original Message- From: Gleb Natapov [mailto:g...@redhat.com] Sent: Tuesday, February 05, 2013 12:05 AM To: Blue Swirl Cc: Pandarathil, Vijaymohan R; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO- PCI devices On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote: On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R vijaymohan.pandarat...@hp.com wrote: - Create eventfd per vfio device assigned to a guest and register an event handler - This fd is passed to the vfio_pci driver through the SET_IRQ ioctl - When the device encounters an error, the eventfd is signalled and the qemu eventfd handler gets invoked. - In the handler decide what action to take. Current action taken is to terminate the guest. Usually this is not OK, but I guess this is not guest triggerable. Still not OK. Why not stop a guest with appropriate stop reason? The thinking was that since this is a hardware error, we would want to stop the guest at the earliest. The hw_error() routine which aborts the qemu process was suggested by Alex and that seemed appropriate. Earlier I was using qemu_system_shutdown_request(). Any suggestions ? Thanks Vijay -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] KVM: MMU: lazily drop large spte
On Tue, Feb 05, 2013 at 03:11:09PM +0800, Xiao Guangrong wrote: Currently, kvm zaps the large spte if write-protected is needed, the later read can fault on that spte. Actually, we can make the large spte readonly instead of making them un-present, the page fault caused by read access can be avoid The idea is from Avi: | As I mentioned before, write-protecting a large spte is a good idea, | since it moves some work from protect-time to fault-time, so it reduces | jitter. This removes the need for the return value. Signed-off-by: Xiao Guangrong xiaoguangr...@linux.vnet.ibm.com Reviewed-by: Gleb Natapov g...@redhat.com --- Changelog: v3: - address Gleb's comments, we make the function return true if flush is needed instead of returning it via pointer to a variable - improve the changelog arch/x86/kvm/mmu.c | 23 +++ 1 files changed, 7 insertions(+), 16 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 42ba85c..ff2fc80 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1106,8 +1106,7 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep) /* * Write-protect on the specified @sptep, @pt_protect indicates whether - * spte writ-protection is caused by protecting shadow page table. - * @flush indicates whether tlb need be flushed. + * spte write-protection is caused by protecting shadow page table. * * Note: write protection is difference between drity logging and spte * protection: @@ -1116,10 +1115,9 @@ static void drop_large_spte(struct kvm_vcpu *vcpu, u64 *sptep) * - for spte protection, the spte can be writable only after unsync-ing * shadow page. * - * Return true if the spte is dropped. + * Return true if tlb need be flushed. */ -static bool -spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect) +static bool spte_write_protect(struct kvm *kvm, u64 *sptep, bool pt_protect) { u64 spte = *sptep; @@ -1129,17 +1127,11 @@ spte_write_protect(struct kvm *kvm, u64 *sptep, bool *flush, bool pt_protect) rmap_printk(rmap_write_protect: spte %p %llx\n, sptep, *sptep); - if (__drop_large_spte(kvm, sptep)) { - *flush |= true; - return true; - } - if (pt_protect) spte = ~SPTE_MMU_WRITEABLE; spte = spte ~PT_WRITABLE_MASK; - *flush |= mmu_spte_update(sptep, spte); - return false; + return mmu_spte_update(sptep, spte); } static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp, @@ -1151,11 +1143,8 @@ static bool __rmap_write_protect(struct kvm *kvm, unsigned long *rmapp, for (sptep = rmap_get_first(*rmapp, iter); sptep;) { BUG_ON(!(*sptep PT_PRESENT_MASK)); - if (spte_write_protect(kvm, sptep, flush, pt_protect)) { - sptep = rmap_get_first(*rmapp, iter); - continue; - } + flush |= spte_write_protect(kvm, sptep, pt_protect); sptep = rmap_get_next(iter); } @@ -2611,6 +2600,8 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write, break; } + drop_large_spte(vcpu, iterator.sptep); + if (!is_shadow_present_pte(*iterator.sptep)) { u64 base_addr = iterator.addr; -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices
On Tue, Feb 05, 2013 at 09:05:19AM +, Pandarathil, Vijaymohan R wrote: -Original Message- From: Gleb Natapov [mailto:g...@redhat.com] Sent: Tuesday, February 05, 2013 12:05 AM To: Blue Swirl Cc: Pandarathil, Vijaymohan R; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO- PCI devices On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote: On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R vijaymohan.pandarat...@hp.com wrote: - Create eventfd per vfio device assigned to a guest and register an event handler - This fd is passed to the vfio_pci driver through the SET_IRQ ioctl - When the device encounters an error, the eventfd is signalled and the qemu eventfd handler gets invoked. - In the handler decide what action to take. Current action taken is to terminate the guest. Usually this is not OK, but I guess this is not guest triggerable. Still not OK. Why not stop a guest with appropriate stop reason? The thinking was that since this is a hardware error, we would want to stop the guest at the earliest. The hw_error() routine which aborts the qemu process was suggested by Alex and that seemed appropriate. Earlier I was using qemu_system_shutdown_request(). Any suggestions ? I am thinking vm_stop(). Stopping SMP guest (and UP too in fact) involves sending IPIs to other cpus running guest's vcpus. Both exit() and vm_stop() will do it, but former is implicitly in the kernel and later is explicitly in QEMU. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: I may have found a bug(unlikely) with Qemu-kvm 1.1
From: verucasal...@hotmail.co.uk To: arm...@redhat.com CC: kvm@vger.kernel.org Subject: RE: I may have found a bug(unlikely) with Qemu-kvm 1.1 Date: Tue, 5 Feb 2013 09:19:13 + Thank you very much Markus. We were literally two minutes ahead, we've added a tty to -monitor so that a function key exposes it; actually, this is better than the original qemu monitor. So now it's (say) -monitor /dev/tty2 and a simple ctrl-alt-f2 exposes a momitor screen. Sorry for the false alarm, and once again, thanks a million. Simon From: arm...@redhat.com To: verucasal...@hotmail.co.uk CC: kvm@vger.kernel.org Subject: Re: I may have found a bug(unlikely) with Qemu-kvm 1.1 Date: Tue, 5 Feb 2013 09:48:59 +0100 Veruca Salt verucasal...@hotmail.co.uk writes: We are upgrading, carefully, from 1.0 to 1.1. Keep going; we're about to release 1.4 ;) We welcome the improved AC97 support and the loss of the ehci warning message on startup. I am finding an issue getting through to the monitor however. Neither ctrl-alt-shift-2 nor ctrl-alt-2 exposes the monitor. I never use that one, but it should work. I have even tried using the -mon option code examples to force a monitor to stdio, but the monitor stubbornly refuses to appear. Any advice would be greatly appreciated. These all work for me: * Shorthand syntax -monitor stdio * Equivalent longhand -chardev stdio,id=mon0 -mon chardev=mon0,mode=readline,default=on * Same as config file snippet, for use with -readconfig [chardev mon0] backend = stdio [mon] chardev = mon0 mode = readline default = on -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Also, sorry for top post. I know how you guys hate that. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: I may have found a bug(unlikely) with Qemu-kvm 1.1
Thank you very much Markus. We were literally two minutes ahead, we've added a tty to -monitor so that a function key exposes it; actually, this is better than the original qemu monitor. So now it's (say) -monitor /dev/tty2 and a simple ctrl-alt-f2 exposes a momitor screen. Sorry for the false alarm, and once again, thanks a million. Simon From: arm...@redhat.com To: verucasal...@hotmail.co.uk CC: kvm@vger.kernel.org Subject: Re: I may have found a bug(unlikely) with Qemu-kvm 1.1 Date: Tue, 5 Feb 2013 09:48:59 +0100 Veruca Salt verucasal...@hotmail.co.uk writes: We are upgrading, carefully, from 1.0 to 1.1. Keep going; we're about to release 1.4 ;) We welcome the improved AC97 support and the loss of the ehci warning message on startup. I am finding an issue getting through to the monitor however. Neither ctrl-alt-shift-2 nor ctrl-alt-2 exposes the monitor. I never use that one, but it should work. I have even tried using the -mon option code examples to force a monitor to stdio, but the monitor stubbornly refuses to appear. Any advice would be greatly appreciated. These all work for me: * Shorthand syntax -monitor stdio * Equivalent longhand -chardev stdio,id=mon0 -mon chardev=mon0,mode=readline,default=on * Same as config file snippet, for use with -readconfig [chardev mon0] backend = stdio [mon] chardev = mon0 mode = readline default = on -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting
Gleb Natapov wrote on 2013-02-05: On Tue, Feb 05, 2013 at 05:57:14AM +, Zhang, Yang Z wrote: Marcelo Tosatti wrote on 2013-02-05: On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote: On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote: On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote: Any example how software relies on such two-interrupts-queued-in-IRR/ISR behaviour? Don't know about guests, but KVM relies on it to detect interrupt coalescing. So if interrupt is set in IRR but not in PIR interrupt will not be reported as coalesced, but it will be coalesced during PIR-IRR merge. Yes, so: 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no. 2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer. 3. vcpu outside of guest mode. 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no. 5. vcpu enters guest mode. 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no. 7. HW transfers PIR into IRR. set_irq return value at 7 is incorrect, interrupt event was _not_ queued. Not sure I understand the flow of events in your description correctly. As I understand it at 4 set_irq() will return incorrect result. Basically when PIR is set to 1 while IRR has 1 for the vector the value of set_irq() will be incorrect. At 4 it has not been coalesced: it has been queued to IRR. At 6 it has been coalesced: PIR bit merged into IRR bit. Frankly I do not see how it can be fixed without any race with present HW PIR design. At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR already set, don't set PIR. Or: apic_accept_interrupt() { 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR. Never set IRR when HWAPIC enabled, even if outside of guest mode. 2. Set PIR and let HW or SW VM-entry transfer it to IRR. 3. set_irq return value: (ORIG_PIR or ORIG_IRR set). } Two or more concurrent set_irq can race with each other, though. Can either document the race or add a lock. According the SDM, software should not touch the IRR when target vcpu is running. Instead, use locked way to access PIR. So your solution may wrong. Then your apicv patches are broken, because they do exactly that. Which code is broken? The only problem is the step 6, but at that point, there already an interrupt pending in IRR. This means the interrupt will be handled not lost. And even in real hardware, this case do exist. So I think it should not be a problem. This is not the problem we are trying to fix. Sometimes we need to make sure that each interrupt device generates result in an interrupt handler invocation in a guest. If interrupt is coalesced (meaning it will not correspond to separate invocation of a guest interrupt handler) it needs to be re-injected. With PIR detection of such condition is broken. Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting
On Tue, Feb 05, 2013 at 10:35:55AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-02-05: On Tue, Feb 05, 2013 at 05:57:14AM +, Zhang, Yang Z wrote: Marcelo Tosatti wrote on 2013-02-05: On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote: On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote: On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote: Any example how software relies on such two-interrupts-queued-in-IRR/ISR behaviour? Don't know about guests, but KVM relies on it to detect interrupt coalescing. So if interrupt is set in IRR but not in PIR interrupt will not be reported as coalesced, but it will be coalesced during PIR-IRR merge. Yes, so: 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no. 2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer. 3. vcpu outside of guest mode. 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no. 5. vcpu enters guest mode. 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no. 7. HW transfers PIR into IRR. set_irq return value at 7 is incorrect, interrupt event was _not_ queued. Not sure I understand the flow of events in your description correctly. As I understand it at 4 set_irq() will return incorrect result. Basically when PIR is set to 1 while IRR has 1 for the vector the value of set_irq() will be incorrect. At 4 it has not been coalesced: it has been queued to IRR. At 6 it has been coalesced: PIR bit merged into IRR bit. Frankly I do not see how it can be fixed without any race with present HW PIR design. At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR already set, don't set PIR. Or: apic_accept_interrupt() { 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR. Never set IRR when HWAPIC enabled, even if outside of guest mode. 2. Set PIR and let HW or SW VM-entry transfer it to IRR. 3. set_irq return value: (ORIG_PIR or ORIG_IRR set). } Two or more concurrent set_irq can race with each other, though. Can either document the race or add a lock. According the SDM, software should not touch the IRR when target vcpu is running. Instead, use locked way to access PIR. So your solution may wrong. Then your apicv patches are broken, because they do exactly that. Which code is broken? The one that updates IRR directly on the apic page. The only problem is the step 6, but at that point, there already an interrupt pending in IRR. This means the interrupt will be handled not lost. And even in real hardware, this case do exist. So I think it should not be a problem. This is not the problem we are trying to fix. Sometimes we need to make sure that each interrupt device generates result in an interrupt handler invocation in a guest. If interrupt is coalesced (meaning it will not correspond to separate invocation of a guest interrupt handler) it needs to be re-injected. With PIR detection of such condition is broken. Best regards, Yang -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting
Gleb Natapov wrote on 2013-02-05: On Tue, Feb 05, 2013 at 10:35:55AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-02-05: On Tue, Feb 05, 2013 at 05:57:14AM +, Zhang, Yang Z wrote: Marcelo Tosatti wrote on 2013-02-05: On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote: On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote: On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote: Any example how software relies on such two-interrupts-queued-in-IRR/ISR behaviour? Don't know about guests, but KVM relies on it to detect interrupt coalescing. So if interrupt is set in IRR but not in PIR interrupt will not be reported as coalesced, but it will be coalesced during PIR-IRR merge. Yes, so: 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no. 2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer. 3. vcpu outside of guest mode. 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no. 5. vcpu enters guest mode. 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no. 7. HW transfers PIR into IRR. set_irq return value at 7 is incorrect, interrupt event was _not_ queued. Not sure I understand the flow of events in your description correctly. As I understand it at 4 set_irq() will return incorrect result. Basically when PIR is set to 1 while IRR has 1 for the vector the value of set_irq() will be incorrect. At 4 it has not been coalesced: it has been queued to IRR. At 6 it has been coalesced: PIR bit merged into IRR bit. Frankly I do not see how it can be fixed without any race with present HW PIR design. At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR already set, don't set PIR. Or: apic_accept_interrupt() { 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR. Never set IRR when HWAPIC enabled, even if outside of guest mode. 2. Set PIR and let HW or SW VM-entry transfer it to IRR. 3. set_irq return value: (ORIG_PIR or ORIG_IRR set). } Two or more concurrent set_irq can race with each other, though. Can either document the race or add a lock. According the SDM, software should not touch the IRR when target vcpu is running. Instead, use locked way to access PIR. So your solution may wrong. Then your apicv patches are broken, because they do exactly that. Which code is broken? The one that updates IRR directly on the apic page. No, all the updates are ensuring the target vcpu is not running. So it's safe to touch IRR. The only problem is the step 6, but at that point, there already an interrupt pending in IRR. This means the interrupt will be handled not lost. And even in real hardware, this case do exist. So I think it should not be a problem. This is not the problem we are trying to fix. Sometimes we need to make sure that each interrupt device generates result in an interrupt handler invocation in a guest. If interrupt is coalesced (meaning it will not correspond to separate invocation of a guest interrupt handler) it needs to be re-injected. With PIR detection of such condition is broken. Best regards, Yang -- Gleb. Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices
-Original Message- From: Gleb Natapov [mailto:g...@redhat.com] Sent: Tuesday, February 05, 2013 1:21 AM To: Pandarathil, Vijaymohan R Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO- PCI devices On Tue, Feb 05, 2013 at 09:05:19AM +, Pandarathil, Vijaymohan R wrote: -Original Message- From: Gleb Natapov [mailto:g...@redhat.com] Sent: Tuesday, February 05, 2013 12:05 AM To: Blue Swirl Cc: Pandarathil, Vijaymohan R; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux- p...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO- PCI devices On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote: On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R vijaymohan.pandarat...@hp.com wrote: - Create eventfd per vfio device assigned to a guest and register an event handler - This fd is passed to the vfio_pci driver through the SET_IRQ ioctl - When the device encounters an error, the eventfd is signalled and the qemu eventfd handler gets invoked. - In the handler decide what action to take. Current action taken is to terminate the guest. Usually this is not OK, but I guess this is not guest triggerable. Still not OK. Why not stop a guest with appropriate stop reason? The thinking was that since this is a hardware error, we would want to stop the guest at the earliest. The hw_error() routine which aborts the qemu process was suggested by Alex and that seemed appropriate. Earlier I was using qemu_system_shutdown_request(). Any suggestions ? I am thinking vm_stop(). Stopping SMP guest (and UP too in fact) involves sending IPIs to other cpus running guest's vcpus. Both exit() and vm_stop() will do it, but former is implicitly in the kernel and later is explicitly in QEMU. I had used vm_stop(RUN_STATE_SHUTDOWN) earlier in my code. But while testing, guest ended up in a hang rather than exiting. There seems to some cleanup work which is being done as part of vm_stop. In our case, we wanted the guest to exit immediately. So use of hw_error() seemed appropriate. Thoughts ? Vijay -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting
On Tue, Feb 05, 2013 at 10:58:28AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-02-05: On Tue, Feb 05, 2013 at 10:35:55AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-02-05: On Tue, Feb 05, 2013 at 05:57:14AM +, Zhang, Yang Z wrote: Marcelo Tosatti wrote on 2013-02-05: On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote: On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote: On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote: Any example how software relies on such two-interrupts-queued-in-IRR/ISR behaviour? Don't know about guests, but KVM relies on it to detect interrupt coalescing. So if interrupt is set in IRR but not in PIR interrupt will not be reported as coalesced, but it will be coalesced during PIR-IRR merge. Yes, so: 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no. 2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer. 3. vcpu outside of guest mode. 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no. 5. vcpu enters guest mode. 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no. 7. HW transfers PIR into IRR. set_irq return value at 7 is incorrect, interrupt event was _not_ queued. Not sure I understand the flow of events in your description correctly. As I understand it at 4 set_irq() will return incorrect result. Basically when PIR is set to 1 while IRR has 1 for the vector the value of set_irq() will be incorrect. At 4 it has not been coalesced: it has been queued to IRR. At 6 it has been coalesced: PIR bit merged into IRR bit. Frankly I do not see how it can be fixed without any race with present HW PIR design. At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR already set, don't set PIR. Or: apic_accept_interrupt() { 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR. Never set IRR when HWAPIC enabled, even if outside of guest mode. 2. Set PIR and let HW or SW VM-entry transfer it to IRR. 3. set_irq return value: (ORIG_PIR or ORIG_IRR set). } Two or more concurrent set_irq can race with each other, though. Can either document the race or add a lock. According the SDM, software should not touch the IRR when target vcpu is running. Instead, use locked way to access PIR. So your solution may wrong. Then your apicv patches are broken, because they do exactly that. Which code is broken? The one that updates IRR directly on the apic page. No, all the updates are ensuring the target vcpu is not running. So it's safe to touch IRR. Not at all. Read the code. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices
On Tue, Feb 05, 2013 at 10:59:41AM +, Pandarathil, Vijaymohan R wrote: -Original Message- From: Gleb Natapov [mailto:g...@redhat.com] Sent: Tuesday, February 05, 2013 1:21 AM To: Pandarathil, Vijaymohan R Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO- PCI devices On Tue, Feb 05, 2013 at 09:05:19AM +, Pandarathil, Vijaymohan R wrote: -Original Message- From: Gleb Natapov [mailto:g...@redhat.com] Sent: Tuesday, February 05, 2013 12:05 AM To: Blue Swirl Cc: Pandarathil, Vijaymohan R; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux- p...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO- PCI devices On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote: On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R vijaymohan.pandarat...@hp.com wrote: - Create eventfd per vfio device assigned to a guest and register an event handler - This fd is passed to the vfio_pci driver through the SET_IRQ ioctl - When the device encounters an error, the eventfd is signalled and the qemu eventfd handler gets invoked. - In the handler decide what action to take. Current action taken is to terminate the guest. Usually this is not OK, but I guess this is not guest triggerable. Still not OK. Why not stop a guest with appropriate stop reason? The thinking was that since this is a hardware error, we would want to stop the guest at the earliest. The hw_error() routine which aborts the qemu process was suggested by Alex and that seemed appropriate. Earlier I was using qemu_system_shutdown_request(). Any suggestions ? I am thinking vm_stop(). Stopping SMP guest (and UP too in fact) involves sending IPIs to other cpus running guest's vcpus. Both exit() and vm_stop() will do it, but former is implicitly in the kernel and later is explicitly in QEMU. I had used vm_stop(RUN_STATE_SHUTDOWN) earlier in my code. But while testing, guest ended up in a hang rather than exiting. There seems to some cleanup work which is being done as part of vm_stop. In our case, we wanted the guest to exit immediately. So use of hw_error() seemed appropriate. What makes you think it hang? It stopped, precisely what it should do if you call vm_stop(). Now it is possible for vm user to investigate what happened and even salvage some data from guest memory. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/3] KVM: MMU: simple cleanups
On Tue, Feb 05, 2013 at 03:26:21PM +0800, Xiao Guangrong wrote: There are the simple cleanups for MMU, no function / logic changed. Marcelo, Gleb, please apply them after applying [PATCH v3] KVM: MMU: lazily drop large spte Changelog: no change, just split them from the previous patchset for good review. Thanks! Reviewed-by: Gleb Natapov g...@redhat.com -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices
-Original Message- From: linux-pci-ow...@vger.kernel.org [mailto:linux-pci- ow...@vger.kernel.org] On Behalf Of Gleb Natapov Sent: Tuesday, February 05, 2013 3:37 AM To: Pandarathil, Vijaymohan R Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO- PCI devices On Tue, Feb 05, 2013 at 10:59:41AM +, Pandarathil, Vijaymohan R wrote: -Original Message- From: Gleb Natapov [mailto:g...@redhat.com] Sent: Tuesday, February 05, 2013 1:21 AM To: Pandarathil, Vijaymohan R Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO- PCI devices On Tue, Feb 05, 2013 at 09:05:19AM +, Pandarathil, Vijaymohan R wrote: -Original Message- From: Gleb Natapov [mailto:g...@redhat.com] Sent: Tuesday, February 05, 2013 12:05 AM To: Blue Swirl Cc: Pandarathil, Vijaymohan R; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux- p...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO- PCI devices On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote: On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R vijaymohan.pandarat...@hp.com wrote: - Create eventfd per vfio device assigned to a guest and register an event handler - This fd is passed to the vfio_pci driver through the SET_IRQ ioctl - When the device encounters an error, the eventfd is signalled and the qemu eventfd handler gets invoked. - In the handler decide what action to take. Current action taken is to terminate the guest. Usually this is not OK, but I guess this is not guest triggerable. Still not OK. Why not stop a guest with appropriate stop reason? The thinking was that since this is a hardware error, we would want to stop the guest at the earliest. The hw_error() routine which aborts the qemu process was suggested by Alex and that seemed appropriate. Earlier I was using qemu_system_shutdown_request(). Any suggestions ? I am thinking vm_stop(). Stopping SMP guest (and UP too in fact) involves sending IPIs to other cpus running guest's vcpus. Both exit() and vm_stop() will do it, but former is implicitly in the kernel and later is explicitly in QEMU. I had used vm_stop(RUN_STATE_SHUTDOWN) earlier in my code. But while testing, guest ended up in a hang rather than exiting. There seems to some cleanup work which is being done as part of vm_stop. In our case, we wanted the guest to exit immediately. So use of hw_error() seemed appropriate. What makes you think it hang? It stopped, precisely what it should do if you call vm_stop(). Now it is possible for vm user to investigate what happened and even salvage some data from guest memory. That was ignorance on my part on the expected behavior of vm_stop(). So what you are suggesting is to stop the guest displaying an appropriate error/next-steps message and have the users do any data-collection/investigation and then manually kill the guest, if they so desire. Right ? Sounds reasonable. As long as the guest is not touching the device, it should be okay. Alex, Any comments ? Thanks Vijay -- Gleb. -- To unsubscribe from this list: send the line unsubscribe linux-pci in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Investigating abnormal stealtimes
On Tue, Feb 5, 2013 at 1:26 AM, Marcelo Tosatti mtosa...@redhat.com wrote: - 'Steal time' is the amount of time taken while vcpu is able to run but not runnable. Maybe 'vmexit latency' is a better name. You are right, 'vmexit latency' is a better name. - Perhaps it would be good to subtract the time the thread was involuntarily scheduled out due 'timeslice' expiration. Otherwise, running a CPU intensive task returns false positives (that is, long delays to due to reschedule due to 'timeslice' exhausted by guest CPU activity, not due to KVM or QEMU issues such as voluntarily schedule in pthread_mutex_lock). Alternatively you can raise the priority of the vcpu threads (to get rid of the false positives). I think this depends on the use-case. If the aim is to find out why the guest has poor response times then timeslice expiration is interesting. If the aim is to optimize QEMU or kvm.ko then timeslice expiration is a nuisance :). Your idea to raise the vcpu thread priority sounds good to me. - Idea: Would be handy to extract trace events in the offending 'latency above threshold' vmexit/vmentry region. Say that you enable other trace events (unrelated to kvm) which can help identify the culprit. Instead of scanning the file manually searching for 100466.1062486786 save one vmexit/vmentry cycle, along with other trace events in that period, in a separate file. Good idea. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices
On Tue, Feb 05, 2013 at 12:05:11PM +, Pandarathil, Vijaymohan R wrote: -Original Message- From: linux-pci-ow...@vger.kernel.org [mailto:linux-pci- ow...@vger.kernel.org] On Behalf Of Gleb Natapov Sent: Tuesday, February 05, 2013 3:37 AM To: Pandarathil, Vijaymohan R Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO- PCI devices On Tue, Feb 05, 2013 at 10:59:41AM +, Pandarathil, Vijaymohan R wrote: -Original Message- From: Gleb Natapov [mailto:g...@redhat.com] Sent: Tuesday, February 05, 2013 1:21 AM To: Pandarathil, Vijaymohan R Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO- PCI devices On Tue, Feb 05, 2013 at 09:05:19AM +, Pandarathil, Vijaymohan R wrote: -Original Message- From: Gleb Natapov [mailto:g...@redhat.com] Sent: Tuesday, February 05, 2013 12:05 AM To: Blue Swirl Cc: Pandarathil, Vijaymohan R; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux- p...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO- PCI devices On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote: On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R vijaymohan.pandarat...@hp.com wrote: - Create eventfd per vfio device assigned to a guest and register an event handler - This fd is passed to the vfio_pci driver through the SET_IRQ ioctl - When the device encounters an error, the eventfd is signalled and the qemu eventfd handler gets invoked. - In the handler decide what action to take. Current action taken is to terminate the guest. Usually this is not OK, but I guess this is not guest triggerable. Still not OK. Why not stop a guest with appropriate stop reason? The thinking was that since this is a hardware error, we would want to stop the guest at the earliest. The hw_error() routine which aborts the qemu process was suggested by Alex and that seemed appropriate. Earlier I was using qemu_system_shutdown_request(). Any suggestions ? I am thinking vm_stop(). Stopping SMP guest (and UP too in fact) involves sending IPIs to other cpus running guest's vcpus. Both exit() and vm_stop() will do it, but former is implicitly in the kernel and later is explicitly in QEMU. I had used vm_stop(RUN_STATE_SHUTDOWN) earlier in my code. But while testing, guest ended up in a hang rather than exiting. There seems to some cleanup work which is being done as part of vm_stop. In our case, we wanted the guest to exit immediately. So use of hw_error() seemed appropriate. What makes you think it hang? It stopped, precisely what it should do if you call vm_stop(). Now it is possible for vm user to investigate what happened and even salvage some data from guest memory. That was ignorance on my part on the expected behavior of vm_stop(). So what you are suggesting is to stop the guest displaying an appropriate error/next-steps message and have the users do any data-collection/investigation and then manually kill the guest, if they so desire. Right ? Yes. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/5] KVM: MMU: unify and cleanup the code of walking pte list
On Tue, Feb 05, 2013 at 04:52:35PM +0800, Xiao Guangrong wrote: Current code has two ways to walk pte_list, the one is pte_list_walk and the another way is rmap_get_first and rmap_get_next, they have the same logic. This patchset tries to unify the code and also make the code more tidy. Patch 1: KVM: MMU: introduce mmu_spte_establish, which tries to eliminates the different between walking parent pte list and rmap, prepare for the later patch. Patch 2: KVM: MMU: clarify the logic in kvm_set_pte_rmapp, which prepares for the next patch, no logic changed. Patch 3: KVM: MMU: unify the code of walking pte list, unify the walking code. Patch 4: KVM: MMU: fix spte assertion, fix a minor bug and remove the duplicate code. Patch 5: KVM: MMU: fast drop all spte on the pte_list, optimize for dropping all sptes on rmap and remove all the goto restart pattern introduced by the Patch 3. Marcelo, Gleb, please apply them after applying the patchset of [PATCH v3 0/3] KVM: MMU: simple cleanups. Changelog: v3: - address Gleb's comments, remove the remained goto restart in kvm_set_pte_rmapp - improve the changelog Reviewed-by: Gleb Natapov g...@redhat.com -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to limit upload bandwidth for a guest server?
On Sun, Feb 03, 2013 at 07:59:07PM -0600, Neil Aggarwal wrote: I have a CentOS server using KVM to host guest servers. I am trying to limit the bandwidth usable by a guest server. I tried to use tc, but that is only limiting the download bandwidth to a server. It does not seem to filter packets uploaded by the server. Is there a tool to limit upload traffic for a guest server? Consider using management tools like libvirt that handle tc and friends for you. The domain XML is interfacebandwidthinbound and outbound. Back to the question, are you looking for ingress qdiscs? http://www.lartc.org/howto/lartc.adv-qdisc.ingress.html This is a standard tc question, not related to virtualization. You may get better help from Linux networking mailing lists or IRC channels. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVH call agenda for 2013-02-05
Juan Quintela quint...@redhat.com writes: Hi Please send in any agenda topics you are interested in. FYI, I have a conflict for today so I won't be able to attend. Regards, Anthony Liguori Later, Juan. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting
Gleb Natapov wrote on 2013-02-05: On Tue, Feb 05, 2013 at 10:58:28AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-02-05: On Tue, Feb 05, 2013 at 10:35:55AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-02-05: On Tue, Feb 05, 2013 at 05:57:14AM +, Zhang, Yang Z wrote: Marcelo Tosatti wrote on 2013-02-05: On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote: On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote: On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote: Any example how software relies on such two-interrupts-queued-in-IRR/ISR behaviour? Don't know about guests, but KVM relies on it to detect interrupt coalescing. So if interrupt is set in IRR but not in PIR interrupt will not be reported as coalesced, but it will be coalesced during PIR-IRR merge. Yes, so: 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no. 2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer. 3. vcpu outside of guest mode. 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no. 5. vcpu enters guest mode. 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no. 7. HW transfers PIR into IRR. set_irq return value at 7 is incorrect, interrupt event was _not_ queued. Not sure I understand the flow of events in your description correctly. As I understand it at 4 set_irq() will return incorrect result. Basically when PIR is set to 1 while IRR has 1 for the vector the value of set_irq() will be incorrect. At 4 it has not been coalesced: it has been queued to IRR. At 6 it has been coalesced: PIR bit merged into IRR bit. Frankly I do not see how it can be fixed without any race with present HW PIR design. At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR already set, don't set PIR. Or: apic_accept_interrupt() { 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR. Never set IRR when HWAPIC enabled, even if outside of guest mode. 2. Set PIR and let HW or SW VM-entry transfer it to IRR. 3. set_irq return value: (ORIG_PIR or ORIG_IRR set). } Two or more concurrent set_irq can race with each other, though. Can either document the race or add a lock. According the SDM, software should not touch the IRR when target vcpu is running. Instead, use locked way to access PIR. So your solution may wrong. Then your apicv patches are broken, because they do exactly that. Which code is broken? The one that updates IRR directly on the apic page. No, all the updates are ensuring the target vcpu is not running. So it's safe to touch IRR. Not at all. Read the code. Sorry. I still cannot figure out which code is wrong. All the places call sync_pir_to_irr() are on target vcpu. Can you point out the code? Thanks. Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting
On Tue, Feb 05, 2013 at 01:26:42PM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-02-05: On Tue, Feb 05, 2013 at 10:58:28AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-02-05: On Tue, Feb 05, 2013 at 10:35:55AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-02-05: On Tue, Feb 05, 2013 at 05:57:14AM +, Zhang, Yang Z wrote: Marcelo Tosatti wrote on 2013-02-05: On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote: On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote: On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote: Any example how software relies on such two-interrupts-queued-in-IRR/ISR behaviour? Don't know about guests, but KVM relies on it to detect interrupt coalescing. So if interrupt is set in IRR but not in PIR interrupt will not be reported as coalesced, but it will be coalesced during PIR-IRR merge. Yes, so: 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no. 2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer. 3. vcpu outside of guest mode. 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no. 5. vcpu enters guest mode. 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no. 7. HW transfers PIR into IRR. set_irq return value at 7 is incorrect, interrupt event was _not_ queued. Not sure I understand the flow of events in your description correctly. As I understand it at 4 set_irq() will return incorrect result. Basically when PIR is set to 1 while IRR has 1 for the vector the value of set_irq() will be incorrect. At 4 it has not been coalesced: it has been queued to IRR. At 6 it has been coalesced: PIR bit merged into IRR bit. Frankly I do not see how it can be fixed without any race with present HW PIR design. At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR already set, don't set PIR. Or: apic_accept_interrupt() { 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR. Never set IRR when HWAPIC enabled, even if outside of guest mode. 2. Set PIR and let HW or SW VM-entry transfer it to IRR. 3. set_irq return value: (ORIG_PIR or ORIG_IRR set). } Two or more concurrent set_irq can race with each other, though. Can either document the race or add a lock. According the SDM, software should not touch the IRR when target vcpu is running. Instead, use locked way to access PIR. So your solution may wrong. Then your apicv patches are broken, because they do exactly that. Which code is broken? The one that updates IRR directly on the apic page. No, all the updates are ensuring the target vcpu is not running. So it's safe to touch IRR. Not at all. Read the code. Sorry. I still cannot figure out which code is wrong. All the places call sync_pir_to_irr() are on target vcpu. Can you point out the code? Thanks. I am taking about vapic patches which are already in, not pir patches. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: DMAR faults from unrelated device when vfio is used
Am Montag, den 04.02.2013, 08:49 -0700 schrieb Alex Williamson: Can you clarify what you mean by assign? Are you actually assigning the root ports to the qemu guest (1c.0 1c.6)? vfio will require they be owned by vfio-pci to make use of 3:00.0, but assigning them to the guest is not recommended. Can you provided your qemu command line? I did hand all of them to the guest OS. Removing 1c.0 1c.6 from the qemu command line seems to have done the trick. Thanks! Here's my working qemu command line: qemu-kvm -no-reboot -enable-kvm -cpu host -smp 4 -m 6G \ -drive file=/home/test/qemu/images/win7_base_updated.qcow2,if=virtio,cache=none,media=disk,format=qcow2,index=0 \ -full-screen -no-quit -no-frame -display sdl -vnc :1 -k de -usbdevice tablet \ -vga std -global VGA.vgamem_mb=256 \ -netdev tap,id=guest0,ifname=tap0,script=no,downscript=no \ -net nic,netdev=guest0,model=virtio,macaddr=00:16:35:BE:EF:12 \ -rtc base=localtime \ -device vfio-pci,host=00:1b.0,id=audio \ -device vfio-pci,host=00:1a.0,id=ehci1 \ -device vfio-pci,host=00:1d.0,id=ehci2 \ -device vfio-pci,host=03:00.0,id=xhci1 \ -monitor tcp::,server,nowait We need to re-visit how to handle pcieport devices with vfio-pci, perhaps white-listing it as a vfio compatible driver, but this still should not interfere with devices external to the group. The DMAR fault address looks pretty bogus unless you happen to have 100GB+ of ram in the system. Nope, definitely not. :) vfio makes use of the IOMMU API for programming DMA translations, so an reserved fields would have to be programmed by intel-iommu itself. We could of course be passing some kind of bogus data that intel-iommu isn't catching. If you're assigning the root ports to the guest, I'd start with that, don't do it. Attach them to vfio, but don't give them to the guest. Maybe that'll give us a hint. I also notice that your USB 3 controller is dead: 03:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev ff) (prog-if ff) !!! Unknown header type 7f We only see unknown header type 7f when the read from the device returns -1. This might have something to do with the root port above it (1c.6) being in state D3. Windows likes to put unused devices in D3, which leads me to suspect you are giving it to the guest. There error does no longer occur. lspci now shows this: -- snip -- 03:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 04) (prog-if 30 [XHCI]) Subsystem: Intel Corporation Device 2008 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort+ SERR- PERR- INTx- Interrupt: pin A routed to IRQ 18 Region 0: Memory at fe50 (64-bit, non-prefetchable) [disabled] [size=8K] Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [70] MSI: Enable- Count=1/8 Maskable- 64bit+ Address: Data: Capabilities: [90] MSI-X: Enable- Count=8 Masked- Vector table: BAR=0 offset=1000 PBA: BAR=0 offset=1080 Capabilities: [a0] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend- LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Latency L0 4us, L1 unlimited ClockPM+ Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis+ DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest- Capabilities: [100
Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices
On Tue, 2013-02-05 at 12:05 +, Pandarathil, Vijaymohan R wrote: -Original Message- From: linux-pci-ow...@vger.kernel.org [mailto:linux-pci- ow...@vger.kernel.org] On Behalf Of Gleb Natapov Sent: Tuesday, February 05, 2013 3:37 AM To: Pandarathil, Vijaymohan R Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO- PCI devices On Tue, Feb 05, 2013 at 10:59:41AM +, Pandarathil, Vijaymohan R wrote: -Original Message- From: Gleb Natapov [mailto:g...@redhat.com] Sent: Tuesday, February 05, 2013 1:21 AM To: Pandarathil, Vijaymohan R Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO- PCI devices On Tue, Feb 05, 2013 at 09:05:19AM +, Pandarathil, Vijaymohan R wrote: -Original Message- From: Gleb Natapov [mailto:g...@redhat.com] Sent: Tuesday, February 05, 2013 12:05 AM To: Blue Swirl Cc: Pandarathil, Vijaymohan R; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux- p...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO- PCI devices On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote: On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R vijaymohan.pandarat...@hp.com wrote: - Create eventfd per vfio device assigned to a guest and register an event handler - This fd is passed to the vfio_pci driver through the SET_IRQ ioctl - When the device encounters an error, the eventfd is signalled and the qemu eventfd handler gets invoked. - In the handler decide what action to take. Current action taken is to terminate the guest. Usually this is not OK, but I guess this is not guest triggerable. Still not OK. Why not stop a guest with appropriate stop reason? The thinking was that since this is a hardware error, we would want to stop the guest at the earliest. The hw_error() routine which aborts the qemu process was suggested by Alex and that seemed appropriate. Earlier I was using qemu_system_shutdown_request(). Any suggestions ? I am thinking vm_stop(). Stopping SMP guest (and UP too in fact) involves sending IPIs to other cpus running guest's vcpus. Both exit() and vm_stop() will do it, but former is implicitly in the kernel and later is explicitly in QEMU. I had used vm_stop(RUN_STATE_SHUTDOWN) earlier in my code. But while testing, guest ended up in a hang rather than exiting. There seems to some cleanup work which is being done as part of vm_stop. In our case, we wanted the guest to exit immediately. So use of hw_error() seemed appropriate. What makes you think it hang? It stopped, precisely what it should do if you call vm_stop(). Now it is possible for vm user to investigate what happened and even salvage some data from guest memory. That was ignorance on my part on the expected behavior of vm_stop(). So what you are suggesting is to stop the guest displaying an appropriate error/next-steps message and have the users do any data-collection/investigation and then manually kill the guest, if they so desire. Right ? Sounds reasonable. As long as the guest is not touching the device, it should be okay. Alex, Any comments ? What's the libvirt behavior when a guest goes to vm_stop? My only concern would be whether the user is going to be confused by a state where the vm is still up, but not running. I imagine they'll have to manually stop it and restart it to continue. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting
Gleb Natapov wrote on 2013-02-05: On Tue, Feb 05, 2013 at 01:26:42PM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-02-05: On Tue, Feb 05, 2013 at 10:58:28AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-02-05: On Tue, Feb 05, 2013 at 10:35:55AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-02-05: On Tue, Feb 05, 2013 at 05:57:14AM +, Zhang, Yang Z wrote: Marcelo Tosatti wrote on 2013-02-05: On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote: On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote: On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote: Any example how software relies on such two-interrupts-queued-in-IRR/ISR behaviour? Don't know about guests, but KVM relies on it to detect interrupt coalescing. So if interrupt is set in IRR but not in PIR interrupt will not be reported as coalesced, but it will be coalesced during PIR-IRR merge. Yes, so: 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no. 2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer. 3. vcpu outside of guest mode. 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no. 5. vcpu enters guest mode. 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no. 7. HW transfers PIR into IRR. set_irq return value at 7 is incorrect, interrupt event was _not_ queued. Not sure I understand the flow of events in your description correctly. As I understand it at 4 set_irq() will return incorrect result. Basically when PIR is set to 1 while IRR has 1 for the vector the value of set_irq() will be incorrect. At 4 it has not been coalesced: it has been queued to IRR. At 6 it has been coalesced: PIR bit merged into IRR bit. Frankly I do not see how it can be fixed without any race with present HW PIR design. At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR already set, don't set PIR. Or: apic_accept_interrupt() { 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR. Never set IRR when HWAPIC enabled, even if outside of guest mode. 2. Set PIR and let HW or SW VM-entry transfer it to IRR. 3. set_irq return value: (ORIG_PIR or ORIG_IRR set). } Two or more concurrent set_irq can race with each other, though. Can either document the race or add a lock. According the SDM, software should not touch the IRR when target vcpu is running. Instead, use locked way to access PIR. So your solution may wrong. Then your apicv patches are broken, because they do exactly that. Which code is broken? The one that updates IRR directly on the apic page. No, all the updates are ensuring the target vcpu is not running. So it's safe to touch IRR. Not at all. Read the code. Sorry. I still cannot figure out which code is wrong. All the places call sync_pir_to_irr() are on target vcpu. Can you point out the code? Thanks. I am taking about vapic patches which are already in, not pir patches. Yes, but the issue will be fixed with pir patches. With posted interrupt, it will touch PIR instead IRR and access PIR is allowed by HW. Best regards, Yang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices
On Tue, Feb 05, 2013 at 06:37:35AM -0700, Alex Williamson wrote: On Tue, 2013-02-05 at 12:05 +, Pandarathil, Vijaymohan R wrote: -Original Message- From: linux-pci-ow...@vger.kernel.org [mailto:linux-pci- ow...@vger.kernel.org] On Behalf Of Gleb Natapov Sent: Tuesday, February 05, 2013 3:37 AM To: Pandarathil, Vijaymohan R Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO- PCI devices On Tue, Feb 05, 2013 at 10:59:41AM +, Pandarathil, Vijaymohan R wrote: -Original Message- From: Gleb Natapov [mailto:g...@redhat.com] Sent: Tuesday, February 05, 2013 1:21 AM To: Pandarathil, Vijaymohan R Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO- PCI devices On Tue, Feb 05, 2013 at 09:05:19AM +, Pandarathil, Vijaymohan R wrote: -Original Message- From: Gleb Natapov [mailto:g...@redhat.com] Sent: Tuesday, February 05, 2013 12:05 AM To: Blue Swirl Cc: Pandarathil, Vijaymohan R; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux- p...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO- PCI devices On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote: On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R vijaymohan.pandarat...@hp.com wrote: - Create eventfd per vfio device assigned to a guest and register an event handler - This fd is passed to the vfio_pci driver through the SET_IRQ ioctl - When the device encounters an error, the eventfd is signalled and the qemu eventfd handler gets invoked. - In the handler decide what action to take. Current action taken is to terminate the guest. Usually this is not OK, but I guess this is not guest triggerable. Still not OK. Why not stop a guest with appropriate stop reason? The thinking was that since this is a hardware error, we would want to stop the guest at the earliest. The hw_error() routine which aborts the qemu process was suggested by Alex and that seemed appropriate. Earlier I was using qemu_system_shutdown_request(). Any suggestions ? I am thinking vm_stop(). Stopping SMP guest (and UP too in fact) involves sending IPIs to other cpus running guest's vcpus. Both exit() and vm_stop() will do it, but former is implicitly in the kernel and later is explicitly in QEMU. I had used vm_stop(RUN_STATE_SHUTDOWN) earlier in my code. But while testing, guest ended up in a hang rather than exiting. There seems to some cleanup work which is being done as part of vm_stop. In our case, we wanted the guest to exit immediately. So use of hw_error() seemed appropriate. What makes you think it hang? It stopped, precisely what it should do if you call vm_stop(). Now it is possible for vm user to investigate what happened and even salvage some data from guest memory. That was ignorance on my part on the expected behavior of vm_stop(). So what you are suggesting is to stop the guest displaying an appropriate error/next-steps message and have the users do any data-collection/investigation and then manually kill the guest, if they so desire. Right ? Sounds reasonable. As long as the guest is not touching the device, it should be okay. Alex, Any comments ? What's the libvirt behavior when a guest goes to vm_stop? My only concern would be whether the user is going to be confused by a state where the vm is still up, but not running. I imagine they'll have to manually stop it and restart it to continue. Thanks, vm_stop() is already the behaviour on KVM internal errors, so management stack knows how to handle those. Of course vfio should use meaningful stop reason and not just reuse existing one. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] x86, apicv: Add Posted Interrupt supporting
On Tue, Feb 05, 2013 at 01:40:44PM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-02-05: On Tue, Feb 05, 2013 at 01:26:42PM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-02-05: On Tue, Feb 05, 2013 at 10:58:28AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-02-05: On Tue, Feb 05, 2013 at 10:35:55AM +, Zhang, Yang Z wrote: Gleb Natapov wrote on 2013-02-05: On Tue, Feb 05, 2013 at 05:57:14AM +, Zhang, Yang Z wrote: Marcelo Tosatti wrote on 2013-02-05: On Mon, Feb 04, 2013 at 05:59:52PM -0200, Marcelo Tosatti wrote: On Mon, Feb 04, 2013 at 07:13:01PM +0200, Gleb Natapov wrote: On Mon, Feb 04, 2013 at 12:43:45PM -0200, Marcelo Tosatti wrote: Any example how software relies on such two-interrupts-queued-in-IRR/ISR behaviour? Don't know about guests, but KVM relies on it to detect interrupt coalescing. So if interrupt is set in IRR but not in PIR interrupt will not be reported as coalesced, but it will be coalesced during PIR-IRR merge. Yes, so: 1. IRR=1, ISR=0, PIR=0. Event: set_irq, coalesced=no. 2. IRR=0, ISR=1, PIR=0. Event: IRR-ISR transfer. 3. vcpu outside of guest mode. 4. IRR=1, ISR=1, PIR=0. Event: set_irq, coalesced=no. 5. vcpu enters guest mode. 6. IRR=1, ISR=1, PIR=1. Event: set_irq, coalesced=no. 7. HW transfers PIR into IRR. set_irq return value at 7 is incorrect, interrupt event was _not_ queued. Not sure I understand the flow of events in your description correctly. As I understand it at 4 set_irq() will return incorrect result. Basically when PIR is set to 1 while IRR has 1 for the vector the value of set_irq() will be incorrect. At 4 it has not been coalesced: it has been queued to IRR. At 6 it has been coalesced: PIR bit merged into IRR bit. Frankly I do not see how it can be fixed without any race with present HW PIR design. At kvm_accept_apic_interrupt, check IRR before setting PIR bit, if IRR already set, don't set PIR. Or: apic_accept_interrupt() { 1. Read ORIG_PIR=PIR, ORIG_IRR=IRR. Never set IRR when HWAPIC enabled, even if outside of guest mode. 2. Set PIR and let HW or SW VM-entry transfer it to IRR. 3. set_irq return value: (ORIG_PIR or ORIG_IRR set). } Two or more concurrent set_irq can race with each other, though. Can either document the race or add a lock. According the SDM, software should not touch the IRR when target vcpu is running. Instead, use locked way to access PIR. So your solution may wrong. Then your apicv patches are broken, because they do exactly that. Which code is broken? The one that updates IRR directly on the apic page. No, all the updates are ensuring the target vcpu is not running. So it's safe to touch IRR. Not at all. Read the code. Sorry. I still cannot figure out which code is wrong. All the places call sync_pir_to_irr() are on target vcpu. Can you point out the code? Thanks. I am taking about vapic patches which are already in, not pir patches. Yes, but the issue will be fixed with pir patches. With posted interrupt, it will touch PIR instead IRR and access PIR is allowed by HW. So the current code is broken? -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO-PCI devices
On Tue, 2013-02-05 at 15:42 +0200, Gleb Natapov wrote: On Tue, Feb 05, 2013 at 06:37:35AM -0700, Alex Williamson wrote: On Tue, 2013-02-05 at 12:05 +, Pandarathil, Vijaymohan R wrote: -Original Message- From: linux-pci-ow...@vger.kernel.org [mailto:linux-pci- ow...@vger.kernel.org] On Behalf Of Gleb Natapov Sent: Tuesday, February 05, 2013 3:37 AM To: Pandarathil, Vijaymohan R Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO- PCI devices On Tue, Feb 05, 2013 at 10:59:41AM +, Pandarathil, Vijaymohan R wrote: -Original Message- From: Gleb Natapov [mailto:g...@redhat.com] Sent: Tuesday, February 05, 2013 1:21 AM To: Pandarathil, Vijaymohan R Cc: Blue Swirl; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux-...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO- PCI devices On Tue, Feb 05, 2013 at 09:05:19AM +, Pandarathil, Vijaymohan R wrote: -Original Message- From: Gleb Natapov [mailto:g...@redhat.com] Sent: Tuesday, February 05, 2013 12:05 AM To: Blue Swirl Cc: Pandarathil, Vijaymohan R; Alex Williamson; Bjorn Helgaas; Ortiz, Lance E; kvm@vger.kernel.org; qemu-de...@nongnu.org; linux- p...@vger.kernel.org; linux-ker...@vger.kernel.org Subject: Re: [PATCH v3 3/3] QEMU-AER: Qemu changes to support AER for VFIO- PCI devices On Sun, Feb 03, 2013 at 04:36:11PM +, Blue Swirl wrote: On Sun, Feb 3, 2013 at 2:10 PM, Pandarathil, Vijaymohan R vijaymohan.pandarat...@hp.com wrote: - Create eventfd per vfio device assigned to a guest and register an event handler - This fd is passed to the vfio_pci driver through the SET_IRQ ioctl - When the device encounters an error, the eventfd is signalled and the qemu eventfd handler gets invoked. - In the handler decide what action to take. Current action taken is to terminate the guest. Usually this is not OK, but I guess this is not guest triggerable. Still not OK. Why not stop a guest with appropriate stop reason? The thinking was that since this is a hardware error, we would want to stop the guest at the earliest. The hw_error() routine which aborts the qemu process was suggested by Alex and that seemed appropriate. Earlier I was using qemu_system_shutdown_request(). Any suggestions ? I am thinking vm_stop(). Stopping SMP guest (and UP too in fact) involves sending IPIs to other cpus running guest's vcpus. Both exit() and vm_stop() will do it, but former is implicitly in the kernel and later is explicitly in QEMU. I had used vm_stop(RUN_STATE_SHUTDOWN) earlier in my code. But while testing, guest ended up in a hang rather than exiting. There seems to some cleanup work which is being done as part of vm_stop. In our case, we wanted the guest to exit immediately. So use of hw_error() seemed appropriate. What makes you think it hang? It stopped, precisely what it should do if you call vm_stop(). Now it is possible for vm user to investigate what happened and even salvage some data from guest memory. That was ignorance on my part on the expected behavior of vm_stop(). So what you are suggesting is to stop the guest displaying an appropriate error/next-steps message and have the users do any data-collection/investigation and then manually kill the guest, if they so desire. Right ? Sounds reasonable. As long as the guest is not touching the device, it should be okay. Alex, Any comments ? What's the libvirt behavior when a guest goes to vm_stop? My only concern would be whether the user is going to be confused by a state where the vm is still up, but not running. I imagine they'll have to manually stop it and restart it to continue. Thanks, vm_stop() is already the behaviour on KVM internal errors, so management stack knows how to handle those. Of course vfio should use meaningful stop reason and not just reuse existing one. Ok, sounds like a good approach to me. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a
Re: KVH call agenda for 2013-02-05
Anthony Liguori anth...@codemonkey.ws wrote: Juan Quintela quint...@redhat.com writes: Hi Please send in any agenda topics you are interested in. FYI, I have a conflict for today so I won't be able to attend. As this was the only answer to the agenda, today call gets cancelled. Happy hacking, Juan. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: DMAR faults from unrelated device when vfio is used
On Tue, 2013-02-05 at 14:31 +0100, David Gstir wrote: Am Montag, den 04.02.2013, 08:49 -0700 schrieb Alex Williamson: Can you clarify what you mean by assign? Are you actually assigning the root ports to the qemu guest (1c.0 1c.6)? vfio will require they be owned by vfio-pci to make use of 3:00.0, but assigning them to the guest is not recommended. Can you provided your qemu command line? I did hand all of them to the guest OS. Removing 1c.0 1c.6 from the qemu command line seems to have done the trick. Thanks! Great, though I'm still not sure how we were generating those DMAR faults. Here's my working qemu command line: qemu-kvm -no-reboot -enable-kvm -cpu host -smp 4 -m 6G \ -drive file=/home/test/qemu/images/win7_base_updated.qcow2,if=virtio,cache=none,media=disk,format=qcow2,index=0 \ -full-screen -no-quit -no-frame -display sdl -vnc :1 -k de -usbdevice tablet \ -vga std -global VGA.vgamem_mb=256 \ -netdev tap,id=guest0,ifname=tap0,script=no,downscript=no \ -net nic,netdev=guest0,model=virtio,macaddr=00:16:35:BE:EF:12 \ -rtc base=localtime \ -device vfio-pci,host=00:1b.0,id=audio \ -device vfio-pci,host=00:1a.0,id=ehci1 \ -device vfio-pci,host=00:1d.0,id=ehci2 \ -device vfio-pci,host=03:00.0,id=xhci1 \ -monitor tcp::,server,nowait We need to re-visit how to handle pcieport devices with vfio-pci, perhaps white-listing it as a vfio compatible driver, but this still should not interfere with devices external to the group. The DMAR fault address looks pretty bogus unless you happen to have 100GB+ of ram in the system. Nope, definitely not. :) vfio makes use of the IOMMU API for programming DMA translations, so an reserved fields would have to be programmed by intel-iommu itself. We could of course be passing some kind of bogus data that intel-iommu isn't catching. If you're assigning the root ports to the guest, I'd start with that, don't do it. Attach them to vfio, but don't give them to the guest. Maybe that'll give us a hint. I also notice that your USB 3 controller is dead: 03:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev ff) (prog-if ff) !!! Unknown header type 7f We only see unknown header type 7f when the read from the device returns -1. This might have something to do with the root port above it (1c.6) being in state D3. Windows likes to put unused devices in D3, which leads me to suspect you are giving it to the guest. There error does no longer occur. lspci now shows this: -- snip -- 03:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 04) (prog-if 30 [XHCI]) Subsystem: Intel Corporation Device 2008 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort+ SERR- PERR- INTx- Interrupt: pin A routed to IRQ 18 Region 0: Memory at fe50 (64-bit, non-prefetchable) [disabled] [size=8K] Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [70] MSI: Enable- Count=1/8 Maskable- 64bit+ Address: Data: Capabilities: [90] MSI-X: Enable- Count=8 Masked- Vector table: BAR=0 offset=1000 PBA: BAR=0 offset=1080 Capabilities: [a0] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend- LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Latency L0 4us, L1 unlimited ClockPM+ Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis+ DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level:
Q: Why not use struct mm_struct to manage guest physical addresses in new port?
Hi, I am starting to working on a port of KVM to an architecture that has a dual TLB. The Guest Virtual Addresses (GVA) are translated to Guest Physical Addresses (GPA) by the first TLB, then a second TLB translates the GPA to a Root Physical Address (RPA). For the sake of this question, we will ignore the GVA-GPA TLB and consider only the GPA-RPA TLB. I seems that most existing ports have a bunch of custom code that manages the GPA-RPA TLB and page tables. Here is what I would like to try: Create a mm for the GPA-RPA mappings each vma would have a fault handler that calls gfn_to_pfn() to look up the proper page. In kvm_arch_vcpu_ioctl_run() we would call switch_mm() to this new 'gva_mm'. Upon exiting guest mode we would switch back to the original mm of the controlling process. For me the benefit of this approach is that all the code that manages the TLB is already implemented and works well for struct mm_struct. The only thing I need to do is write a vma fault handler. That is a lot easier and less error prone than maintaining a parallel TLB management framework and making sure it interacts properly with the existing TLB code for 'normal' processes. Q1: Am I crazy for wanting to try this? Q2: Have others tried this and rejected it? What were the reasons? Thanks in advance, David Daney Cavium, Inc. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: DMAR faults from unrelated device when vfio is used
On Tue, 2013-02-05 at 08:37 -0700, Alex Williamson wrote: On Tue, 2013-02-05 at 14:31 +0100, David Gstir wrote: Am Montag, den 04.02.2013, 08:49 -0700 schrieb Alex Williamson: Can you clarify what you mean by assign? Are you actually assigning the root ports to the qemu guest (1c.0 1c.6)? vfio will require they be owned by vfio-pci to make use of 3:00.0, but assigning them to the guest is not recommended. Can you provided your qemu command line? I did hand all of them to the guest OS. Removing 1c.0 1c.6 from the qemu command line seems to have done the trick. Thanks! Great, though I'm still not sure how we were generating those DMAR faults. Here's my working qemu command line: qemu-kvm -no-reboot -enable-kvm -cpu host -smp 4 -m 6G \ -drive file=/home/test/qemu/images/win7_base_updated.qcow2,if=virtio,cache=none,media=disk,format=qcow2,index=0 \ -full-screen -no-quit -no-frame -display sdl -vnc :1 -k de -usbdevice tablet \ -vga std -global VGA.vgamem_mb=256 \ -netdev tap,id=guest0,ifname=tap0,script=no,downscript=no \ -net nic,netdev=guest0,model=virtio,macaddr=00:16:35:BE:EF:12 \ -rtc base=localtime \ -device vfio-pci,host=00:1b.0,id=audio \ -device vfio-pci,host=00:1a.0,id=ehci1 \ -device vfio-pci,host=00:1d.0,id=ehci2 \ -device vfio-pci,host=03:00.0,id=xhci1 \ -monitor tcp::,server,nowait We need to re-visit how to handle pcieport devices with vfio-pci, perhaps white-listing it as a vfio compatible driver, but this still should not interfere with devices external to the group. The DMAR fault address looks pretty bogus unless you happen to have 100GB+ of ram in the system. Nope, definitely not. :) vfio makes use of the IOMMU API for programming DMA translations, so an reserved fields would have to be programmed by intel-iommu itself. We could of course be passing some kind of bogus data that intel-iommu isn't catching. If you're assigning the root ports to the guest, I'd start with that, don't do it. Attach them to vfio, but don't give them to the guest. Maybe that'll give us a hint. I also notice that your USB 3 controller is dead: 03:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev ff) (prog-if ff) !!! Unknown header type 7f We only see unknown header type 7f when the read from the device returns -1. This might have something to do with the root port above it (1c.6) being in state D3. Windows likes to put unused devices in D3, which leads me to suspect you are giving it to the guest. There error does no longer occur. lspci now shows this: -- snip -- 03:00.0 USB controller: NEC Corporation uPD720200 USB 3.0 Host Controller (rev 04) (prog-if 30 [XHCI]) Subsystem: Intel Corporation Device 2008 Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- MAbort+ SERR- PERR- INTx- Interrupt: pin A routed to IRQ 18 Region 0: Memory at fe50 (64-bit, non-prefetchable) [disabled] [size=8K] Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+) Status: D3 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [70] MSI: Enable- Count=1/8 Maskable- 64bit+ Address: Data: Capabilities: [90] MSI-X: Enable- Count=8 Masked- Vector table: BAR=0 offset=1000 PBA: BAR=0 offset=1080 Capabilities: [a0] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr+ TransPend- LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Latency L0 4us, L1 unlimited ClockPM+ Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+ ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis+ DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Re: DMAR faults from unrelated device when vfio is used
Am Tue, 05 Feb 2013 13:36:53 -0700 schrieb Alex Williamson alex.william...@redhat.com: Ugh, the infamous and useless error 10. It could be anything. I've got a system with onboard usb3, let me see what windows does with it here first. Thanks, Well, I've got an Etron USB3 HBA and (un)fortunately it works just fine with a Win7 guest. There's really nothing special about USB controllers from a PCI device assignment perspective. Have you tried the latest upstream qemu bits? Thanks, We tried also qemu v1.4.0-rc0 (git as of today) without success. As next step we'll test Linux as guest, maybe it is more chatty than Windows regarding the issue. :-) Thanks, //richard -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] Expand the steal time msr to also contain the consigned time.
Expand the steal time msr to also contain the consigned time. Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com --- arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/include/asm/paravirt_types.h |2 +- arch/x86/kernel/kvm.c |7 ++- kernel/sched/core.c | 10 +- kernel/sched/cputime.c|2 +- 5 files changed, 15 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 5edd174..9b753ea 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline u64 paravirt_steal_clock(int cpu) +static inline void paravirt_steal_clock(int cpu, u64 *steal) { - return PVOP_CALL1(u64, pv_time_ops.steal_clock, cpu); + PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); } static inline unsigned long long paravirt_read_pmc(int counter) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 142236e..5d4fc8b 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -95,7 +95,7 @@ struct pv_lazy_ops { struct pv_time_ops { unsigned long long (*sched_clock)(void); - unsigned long long (*steal_clock)(int cpu); + void (*steal_clock)(int cpu, unsigned long long *steal); unsigned long (*get_tsc_khz)(void); }; diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index fe75a28..89e5468 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -386,9 +386,8 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; -static u64 kvm_steal_clock(int cpu) +static void kvm_steal_clock(int cpu, u64 *steal) { - u64 steal; struct kvm_steal_time *src; int version; @@ -396,11 +395,9 @@ static u64 kvm_steal_clock(int cpu) do { version = src-version; rmb(); - steal = src-steal; + *steal = src-steal; rmb(); } while ((version 1) || (version != src-version)); - - return steal; } void kvm_disable_steal_time(void) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 26058d0..efc2652 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -757,6 +757,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) */ #if defined(CONFIG_IRQ_TIME_ACCOUNTING) || defined(CONFIG_PARAVIRT_TIME_ACCOUNTING) s64 steal = 0, irq_delta = 0; + u64 consigned = 0; #endif #ifdef CONFIG_IRQ_TIME_ACCOUNTING irq_delta = irq_time_read(cpu_of(rq)) - rq-prev_irq_time; @@ -785,8 +786,15 @@ static void update_rq_clock_task(struct rq *rq, s64 delta) #ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING if (static_key_false((paravirt_steal_rq_enabled))) { u64 st; + u64 cs; - steal = paravirt_steal_clock(cpu_of(rq)); + paravirt_steal_clock(cpu_of(rq), steal, consigned); + /* +* since we are not assigning the steal time to cpustats +* here, just combine the steal and consigned times to +* do the rest of the calculations. +*/ + steal += consigned; steal -= rq-prev_steal_time_rq; if (unlikely(steal delta)) diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 825a956..0b4f1ec 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -275,7 +275,7 @@ static __always_inline bool steal_account_process_tick(void) if (static_key_false(paravirt_steal_enabled)) { u64 steal, st = 0; - steal = paravirt_steal_clock(smp_processor_id()); + paravirt_steal_clock(smp_processor_id(), steal); steal -= this_rq()-prev_steal_time; st = steal_ticks(steal); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] Add the code to send the consigned time from the host to the guest
Change the paravirt calls that retrieve the steal-time information from the host. Add to it getting the consigned value as well as the steal time. Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com --- arch/x86/include/asm/kvm_host.h |1 + arch/x86/include/asm/paravirt.h |4 ++-- arch/x86/include/uapi/asm/kvm_para.h |3 ++- arch/x86/kernel/kvm.c|3 ++- arch/x86/kernel/paravirt.c |4 ++-- arch/x86/kvm/x86.c |2 ++ include/linux/kernel_stat.h |1 + kernel/sched/cputime.c | 21 +++-- kernel/sched/sched.h |2 ++ 9 files changed, 33 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index dc87b65..fe5a37b 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -428,6 +428,7 @@ struct kvm_vcpu_arch { u64 msr_val; u64 last_steal; u64 accum_steal; + u64 accum_consigned; struct gfn_to_hva_cache stime; struct kvm_steal_time steal; } st; diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 9b753ea..77f05e7 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -196,9 +196,9 @@ struct static_key; extern struct static_key paravirt_steal_enabled; extern struct static_key paravirt_steal_rq_enabled; -static inline void paravirt_steal_clock(int cpu, u64 *steal) +static inline void paravirt_steal_clock(int cpu, u64 *steal, u64 *consigned) { - PVOP_VCALL2(pv_time_ops.steal_clock, cpu, steal); + PVOP_VCALL3(pv_time_ops.steal_clock, cpu, steal, consigned); } static inline unsigned long long paravirt_read_pmc(int counter) diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h index 06fdbd9..55d617f 100644 --- a/arch/x86/include/uapi/asm/kvm_para.h +++ b/arch/x86/include/uapi/asm/kvm_para.h @@ -42,9 +42,10 @@ struct kvm_steal_time { __u64 steal; + __u64 consigned; __u32 version; __u32 flags; - __u32 pad[12]; + __u32 pad[10]; }; #define KVM_STEAL_ALIGNMENT_BITS 5 diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 89e5468..fb52f8a 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -386,7 +386,7 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; -static void kvm_steal_clock(int cpu, u64 *steal) +static void kvm_steal_clock(int cpu, u64 *steal, u64 *consigned) { struct kvm_steal_time *src; int version; @@ -396,6 +396,7 @@ static void kvm_steal_clock(int cpu, u64 *steal) version = src-version; rmb(); *steal = src-steal; + *consigned = src-consigned; rmb(); } while ((version 1) || (version != src-version)); } diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 17fff18..3797683 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -207,9 +207,9 @@ static void native_flush_tlb_single(unsigned long addr) struct static_key paravirt_steal_enabled; struct static_key paravirt_steal_rq_enabled; -static u64 native_steal_clock(int cpu) +static void native_steal_clock(int cpu, u64 *steal, u64 *consigned) { - return 0; + *steal = *consigned = 0; } /* These are in entry.S */ diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index c243b81..51b63d1 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1867,8 +1867,10 @@ static void record_steal_time(struct kvm_vcpu *vcpu) return; vcpu-arch.st.steal.steal += vcpu-arch.st.accum_steal; + vcpu-arch.st.steal.consigned += vcpu-arch.st.accum_consigned; vcpu-arch.st.steal.version += 2; vcpu-arch.st.accum_steal = 0; + vcpu-arch.st.accum_consigned = 0; kvm_write_guest_cached(vcpu-kvm, vcpu-arch.st.stime, vcpu-arch.st.steal, sizeof(struct kvm_steal_time)); diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index e352052..f58ed0f 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -126,6 +126,7 @@ extern unsigned long long task_delta_exec(struct task_struct *); extern void account_user_time(struct task_struct *, cputime_t, cputime_t); extern void account_system_time(struct task_struct *, int, cputime_t, cputime_t); extern void account_steal_time(cputime_t); +extern void account_consigned_time(cputime_t); extern void account_idle_time(cputime_t); #ifdef CONFIG_VIRT_CPU_ACCOUNTING diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c index 0b4f1ec..2a2d4be 100644 --- a/kernel/sched/cputime.c +++ b/kernel/sched/cputime.c @@ -244,6 +244,18 @@ void account_system_time(struct task_struct *p, int hardirq_offset, } /*
[PATCH 4/4] Add a timer to allow the separation of consigned from steal time.
Add a helper routine to scheduler/core.c to allow the kvm module to retrieve the cpu hardlimit settings. The values will be used to set up a timer that is used to separate the consigned from the steal time. Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com --- arch/x86/include/asm/kvm_host.h |9 ++ arch/x86/kvm/x86.c | 62 ++- kernel/sched/core.c | 20 + 3 files changed, 90 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index fe5a37b..9518613 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -355,6 +355,15 @@ struct kvm_vcpu_arch { bool tpr_access_reporting; /* +* timer used to determine if the time should be counted as +* steal time or consigned time. +*/ + struct hrtimer steal_timer; + u64 current_consigned; + s64 consigned_quota; + s64 consigned_period; + + /* * Paging state of the vcpu * * If the vcpu runs in guest mode with two level paging this still saves diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 51b63d1..79d144d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -1848,13 +1848,32 @@ static void kvmclock_reset(struct kvm_vcpu *vcpu) static void accumulate_steal_time(struct kvm_vcpu *vcpu) { u64 delta; + u64 steal_delta; + u64 consigned_delta; if (!(vcpu-arch.st.msr_val KVM_MSR_ENABLED)) return; delta = current-sched_info.run_delay - vcpu-arch.st.last_steal; vcpu-arch.st.last_steal = current-sched_info.run_delay; - vcpu-arch.st.accum_steal = delta; + + /* split the delta into steal and consigned */ + if (vcpu-arch.current_consigned vcpu-arch.consigned_quota) { + vcpu-arch.current_consigned += delta; + if (vcpu-arch.current_consigned vcpu-arch.consigned_quota) { + steal_delta = vcpu-arch.current_consigned + - vcpu-arch.consigned_quota; + consigned_delta = delta - steal_delta; + } else { + consigned_delta = delta; + steal_delta = 0; + } + } else { + consigned_delta = 0; + steal_delta = delta; + } + vcpu-arch.st.accum_steal = steal_delta; + vcpu-arch.st.accum_consigned = consigned_delta; } static void record_steal_time(struct kvm_vcpu *vcpu) @@ -2629,8 +2648,35 @@ static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu) !(vcpu-kvm-arch.iommu_flags KVM_IOMMU_CACHE_COHERENCY); } +extern int sched_use_hard_capping(int cpuid, int num_vcpus, s64 *quota, + s64 *period); +enum hrtimer_restart steal_timer_fn(struct hrtimer *data) +{ + struct kvm_vcpu *vcpu; + struct kvm *kvm; + int num_vcpus; + ktime_t now; + + vcpu = container_of(data, struct kvm_vcpu, arch.steal_timer); + kvm = vcpu-kvm; + num_vcpus = atomic_read(kvm-online_vcpus); + sched_use_hard_capping(vcpu-cpu, num_vcpus, + vcpu-arch.consigned_quota, + vcpu-arch.consigned_period); + vcpu-arch.current_consigned = 0; + now = ktime_get(); + hrtimer_forward(vcpu-arch.steal_timer, now, + ktime_set(0, vcpu-arch.consigned_period)); + + return HRTIMER_RESTART; +} + void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) { + struct kvm *kvm; + int num_vcpus; + ktime_t ktime; + /* Address WBINVD may be executed by guest */ if (need_emulate_wbinvd(vcpu)) { if (kvm_x86_ops-has_wbinvd_exit()) @@ -2670,6 +2716,18 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu) kvm_migrate_timers(vcpu); vcpu-cpu = cpu; } + /* Initialize and start a timer to capture steal and consigned time */ + kvm = vcpu-kvm; + num_vcpus = atomic_read(kvm-online_vcpus); + num_vcpus = (num_vcpus == 0) ? 1 : num_vcpus; + sched_use_hard_capping(vcpu-cpu, num_vcpus, + vcpu-arch.consigned_quota, + vcpu-arch.consigned_period); + hrtimer_init(vcpu-arch.steal_timer, CLOCK_MONOTONIC, + HRTIMER_MODE_REL); + vcpu-arch.steal_timer.function = steal_timer_fn; + ktime = ktime_set(0, vcpu-arch.consigned_period); + hrtimer_start(vcpu-arch.steal_timer, ktime, HRTIMER_MODE_REL); accumulate_steal_time(vcpu); kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu); @@ -2680,6 +2738,7 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu) kvm_x86_ops-vcpu_put(vcpu); kvm_put_guest_fpu(vcpu); vcpu-arch.last_host_tsc
[PATCH 0/4] Alter steal-time reporting in the guest
In the case of where you have a system that is running in a capped or overcommitted environment the user may see steal time being reported in accounting tools such as top or vmstat. This can cause confusion for the end user. To ease the confusion this patch set adds the idea of consigned (expected steal) time. The host will separate the consigned time from the steal time. Tthe steal time will only be altered if hard limits (cfs bandwidth control) is used. The period and the quota used to separate the consigned time (expected steal) from the steal time are taken from the cfs bandwidth control settings. Any other steal time accruing during that period will show as the traditional steal time. Changes from V2: * Dropped the ioctl that allowed qemu to send the entitlement value to the guest. * Added code to get the entitlement period and quota from cfs bandwidth. Changes from V1: * Removed the steal time allowed percentage from the guest * Moved the separation of consigned (expected steal) and steal time to the host. * No longer include a sysctl interface. --- Michael Wolf (4): Alter the amount of steal time reported by the guest. Expand the steal time msr to also contain the consigned time. Add the code to send the consigned time from the host to the guest Add a timer to allow the separation of consigned from steal time. arch/x86/include/asm/kvm_host.h | 10 + arch/x86/include/asm/paravirt.h |4 +- arch/x86/include/asm/paravirt_types.h |2 + arch/x86/include/uapi/asm/kvm_para.h |3 +- arch/x86/kernel/kvm.c |8 ++-- arch/x86/kernel/paravirt.c|4 +- arch/x86/kvm/x86.c| 64 - fs/proc/stat.c|9 - include/linux/kernel_stat.h |2 + kernel/sched/core.c | 30 +++ kernel/sched/cputime.c| 21 ++- kernel/sched/sched.h |2 + 12 files changed, 142 insertions(+), 17 deletions(-) -- Signature -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] Alter the amount of steal time reported by the guest.
Modify the amount of stealtime that the kernel reports via the /proc interface. Steal time will now be broken down into steal_time and consigned_time. Consigned_time will represent the amount of time that is expected to be lost due to overcommitment of the physical cpu or by using cpu hard capping. Signed-off-by: Michael Wolf m...@linux.vnet.ibm.com --- fs/proc/stat.c |9 +++-- include/linux/kernel_stat.h |1 + 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/proc/stat.c b/fs/proc/stat.c index e296572..cb7fe80 100644 --- a/fs/proc/stat.c +++ b/fs/proc/stat.c @@ -82,7 +82,7 @@ static int show_stat(struct seq_file *p, void *v) int i, j; unsigned long jif; u64 user, nice, system, idle, iowait, irq, softirq, steal; - u64 guest, guest_nice; + u64 guest, guest_nice, consign; u64 sum = 0; u64 sum_softirq = 0; unsigned int per_softirq_sums[NR_SOFTIRQS] = {0}; @@ -90,10 +90,11 @@ static int show_stat(struct seq_file *p, void *v) user = nice = system = idle = iowait = irq = softirq = steal = 0; - guest = guest_nice = 0; + guest = guest_nice = consign = 0; getboottime(boottime); jif = boottime.tv_sec; + for_each_possible_cpu(i) { user += kcpustat_cpu(i).cpustat[CPUTIME_USER]; nice += kcpustat_cpu(i).cpustat[CPUTIME_NICE]; @@ -105,6 +106,7 @@ static int show_stat(struct seq_file *p, void *v) steal += kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; guest += kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice += kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; + consign += kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN]; sum += kstat_cpu_irqs_sum(i); sum += arch_irq_stat_cpu(i); @@ -128,6 +130,7 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); + seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign)); seq_putc(p, '\n'); for_each_online_cpu(i) { @@ -142,6 +145,7 @@ static int show_stat(struct seq_file *p, void *v) steal = kcpustat_cpu(i).cpustat[CPUTIME_STEAL]; guest = kcpustat_cpu(i).cpustat[CPUTIME_GUEST]; guest_nice = kcpustat_cpu(i).cpustat[CPUTIME_GUEST_NICE]; + consign = kcpustat_cpu(i).cpustat[CPUTIME_CONSIGN]; seq_printf(p, cpu%d, i); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(user)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(nice)); @@ -153,6 +157,7 @@ static int show_stat(struct seq_file *p, void *v) seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(steal)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest)); seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(guest_nice)); + seq_put_decimal_ull(p, ' ', cputime64_to_clock_t(consign)); seq_putc(p, '\n'); } seq_printf(p, intr %llu, (unsigned long long)sum); diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index 66b7078..e352052 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -28,6 +28,7 @@ enum cpu_usage_stat { CPUTIME_STEAL, CPUTIME_GUEST, CPUTIME_GUEST_NICE, + CPUTIME_CONSIGN, NR_STATS, }; -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] KVM: VMX: disable SMEP feature when guest is in non-paging mode
On Mon, Feb 04, 2013 at 11:50:43AM +0800, Dongxiao Xu wrote: Changes from v1 to v2: - Modify commit message and comments according to Gleb's suggestions. SMEP is disabled if CPU is in non-paging mode in hardware. However KVM always uses paging mode to emulate guest non-paging mode with TDP. To emulate this behavior, SMEP needs to be manually disabled when guest switches to non-paging mode. We met an issue that, SMP Linux guest with recent kernel (enable SMEP support, for example, 3.5.3) would crash with triple fault if setting unrestricted_guest=0. This is because KVM uses an identity mapping page table to emulate the non-paging mode, where the page table is set with USER flag. If SMEP is still enabled in this case, guest will meet unhandlable page fault and then crash. Signed-off-by: Dongxiao Xu dongxiao...@intel.com Signed-off-by: Xiantao Zhang xiantao.zh...@intel.com --- arch/x86/kvm/vmx.c |8 1 files changed, 8 insertions(+), 0 deletions(-) Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: VMX: add missing exit names to VMX_EXIT_REASONS array
Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Remove duplicate text in api.txt
On Thu, Jan 31, 2013 at 12:06:08PM -0800, Geoff Levand wrote: Signed-off-by: Geoff Levand ge...@infradead.org --- Saw this in v3.8-rc5, please apply. Documentation/virtual/kvm/api.txt | 13 - 1 file changed, 13 deletions(-) Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] tcm_vhost: Multi-queue support
This adds virtio-scsi multi-queue support to tcm_vhost. Guest side virtio-scsi multi-queue support can be found here: https://lkml.org/lkml/2012/12/18/166 Some initial perf numbers: 1 queue, 4 targets, 1 lun per target 4K request size, 50% randread + 50% randwrite: 127K/127k IOPS 4 queues, 4 targets, 1 lun per target 4K request size, 50% randread + 50% randwrite: 181K/181k IOPS Signed-off-by: Asias He as...@redhat.com --- drivers/vhost/tcm_vhost.c | 46 +- drivers/vhost/tcm_vhost.h | 2 ++ 2 files changed, 31 insertions(+), 17 deletions(-) diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c index 81ecda5..9951297 100644 --- a/drivers/vhost/tcm_vhost.c +++ b/drivers/vhost/tcm_vhost.c @@ -48,6 +48,7 @@ #include linux/virtio_net.h /* TODO vhost.h currently depends on this */ #include linux/virtio_scsi.h #include linux/llist.h +#include linux/bitmap.h #include vhost.c #include vhost.h @@ -59,7 +60,8 @@ enum { VHOST_SCSI_VQ_IO = 2, }; -#define VHOST_SCSI_MAX_TARGET 256 +#define VHOST_SCSI_MAX_TARGET 256 +#define VHOST_SCSI_MAX_VQ 128 struct vhost_scsi { /* Protected by vhost_scsi-dev.mutex */ @@ -68,7 +70,7 @@ struct vhost_scsi { bool vs_endpoint; struct vhost_dev dev; - struct vhost_virtqueue vqs[3]; + struct vhost_virtqueue vqs[VHOST_SCSI_MAX_VQ]; struct vhost_work vs_completion_work; /* cmd completion work item */ struct llist_head vs_completion_list; /* cmd completion queue */ @@ -366,12 +368,14 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work *work) { struct vhost_scsi *vs = container_of(work, struct vhost_scsi, vs_completion_work); + DECLARE_BITMAP(signal, VHOST_SCSI_MAX_VQ); struct virtio_scsi_cmd_resp v_rsp; struct tcm_vhost_cmd *tv_cmd; struct llist_node *llnode; struct se_cmd *se_cmd; - int ret; + int ret, vq; + bitmap_zero(signal, VHOST_SCSI_MAX_VQ); llnode = llist_del_all(vs-vs_completion_list); while (llnode) { tv_cmd = llist_entry(llnode, struct tcm_vhost_cmd, @@ -390,15 +394,20 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work *work) memcpy(v_rsp.sense, tv_cmd-tvc_sense_buf, v_rsp.sense_len); ret = copy_to_user(tv_cmd-tvc_resp, v_rsp, sizeof(v_rsp)); - if (likely(ret == 0)) - vhost_add_used(vs-vqs[2], tv_cmd-tvc_vq_desc, 0); - else + if (likely(ret == 0)) { + vhost_add_used(tv_cmd-tvc_vq, tv_cmd-tvc_vq_desc, 0); + vq = tv_cmd-tvc_vq - vs-vqs; + __set_bit(vq, signal); + } else pr_err(Faulted on virtio_scsi_cmd_resp\n); vhost_scsi_free_cmd(tv_cmd); } - vhost_signal(vs-dev, vs-vqs[2]); + vq = -1; + while ((vq = find_next_bit(signal, VHOST_SCSI_MAX_VQ, vq + 1)) +VHOST_SCSI_MAX_VQ) + vhost_signal(vs-dev, vs-vqs[vq]); } static struct tcm_vhost_cmd *vhost_scsi_allocate_cmd( @@ -561,9 +570,9 @@ static void tcm_vhost_submission_work(struct work_struct *work) } } -static void vhost_scsi_handle_vq(struct vhost_scsi *vs) +static void vhost_scsi_handle_vq(struct vhost_scsi *vs, + struct vhost_virtqueue *vq) { - struct vhost_virtqueue *vq = vs-vqs[2]; struct virtio_scsi_cmd_req v_req; struct tcm_vhost_tpg *tv_tpg; struct tcm_vhost_cmd *tv_cmd; @@ -656,7 +665,7 @@ static void vhost_scsi_handle_vq(struct vhost_scsi *vs) ret = __copy_to_user(resp, rsp, sizeof(rsp)); if (!ret) vhost_add_used_and_signal(vs-dev, - vs-vqs[2], head, 0); + vq, head, 0); else pr_err(Faulted on virtio_scsi_cmd_resp\n); @@ -678,6 +687,7 @@ static void vhost_scsi_handle_vq(struct vhost_scsi *vs) : %d\n, tv_cmd, exp_data_len, data_direction); tv_cmd-tvc_vhost = vs; + tv_cmd-tvc_vq = vq; if (unlikely(vq-iov[out].iov_len != sizeof(struct virtio_scsi_cmd_resp))) { @@ -758,7 +768,7 @@ static void vhost_scsi_handle_kick(struct vhost_work *work) poll.work); struct vhost_scsi *vs = container_of(vq-dev, struct vhost_scsi, dev); - vhost_scsi_handle_vq(vs); + vhost_scsi_handle_vq(vs, vq); } /* @@ -879,7 +889,7 @@ err: static int vhost_scsi_open(struct inode *inode, struct file *f) { struct vhost_scsi *s; - int r; + int r, i; s =
Re: [PATCH] kvm tools: Fix powerpc build after kvm__dump_mem() change
On Mon, Feb 4, 2013 at 9:37 PM, Will Deacon will.dea...@arm.com wrote: On Mon, Feb 04, 2013 at 12:45:53PM +, Pekka Enberg wrote: On Mon, Feb 4, 2013 at 2:17 PM, Michael Ellerman mich...@ellerman.id.au wrote: Commit 21692d1 (Beautify debug output) broke the powerpc build because it changed the signature for kvm__dump_mem() but didn't update all callers. Signed-off-by: Michael Ellerman mich...@ellerman.id.au Applied, thanks Michael! D'oh! I was about to post a patch to fix this (arm is also broken)... I've dropped the powerpc hunk from my patch (see below). Thanks, Michael and Will! Cheers, Will ---8 From 9cdc8ecfdd2a1ec3075c0a129e934433f94c29dc Mon Sep 17 00:00:00 2001 From: Will Deacon will.dea...@arm.com Date: Sun, 3 Feb 2013 15:37:48 + Subject: [PATCH] kvm tools: arm: fix fallout from debug_fd refactoring Commit 21692d19f744 (kvm tools: Beautify debug output) changed the kvm__dump_mem prototype but only fixed up calls from x86. This patch fixes arm to pass the debug_fd as required. Signed-off-by: Will Deacon will.dea...@arm.com --- tools/kvm/arm/aarch32/kvm-cpu.c | 11 +-- tools/kvm/arm/aarch64/kvm-cpu.c | 11 +-- 2 files changed, 10 insertions(+), 12 deletions(-) diff --git a/tools/kvm/arm/aarch32/kvm-cpu.c b/tools/kvm/arm/aarch32/kvm-cpu.c index a528789..6a012db 100644 --- a/tools/kvm/arm/aarch32/kvm-cpu.c +++ b/tools/kvm/arm/aarch32/kvm-cpu.c @@ -54,25 +54,24 @@ void kvm_cpu__show_code(struct kvm_cpu *vcpu) { struct kvm_one_reg reg; u32 data; + int debug_fd = kvm_cpu__get_debug_fd(); reg.addr = (u64)(unsigned long)data; - printf(*pc:\n); + dprintf(debug_fd, \n*pc:\n); reg.id = ARM_CORE_REG(usr_regs.ARM_pc); if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg) 0) die(KVM_GET_ONE_REG failed (show_code @ PC)); - kvm__dump_mem(vcpu-kvm, data, 32); - printf(\n); + kvm__dump_mem(vcpu-kvm, data, 32, debug_fd); - printf(*lr (svc):\n); + dprintf(debug_fd, \n*lr (svc):\n); reg.id = ARM_CORE_REG(svc_regs[1]); if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg) 0) die(KVM_GET_ONE_REG failed (show_code @ LR_svc)); data = ~0x1; - kvm__dump_mem(vcpu-kvm, data, 32); - printf(\n); + kvm__dump_mem(vcpu-kvm, data, 32, debug_fd); } void kvm_cpu__show_registers(struct kvm_cpu *vcpu) diff --git a/tools/kvm/arm/aarch64/kvm-cpu.c b/tools/kvm/arm/aarch64/kvm-cpu.c index 2eb06ea..7cdcb70 100644 --- a/tools/kvm/arm/aarch64/kvm-cpu.c +++ b/tools/kvm/arm/aarch64/kvm-cpu.c @@ -109,24 +109,23 @@ void kvm_cpu__show_code(struct kvm_cpu *vcpu) { struct kvm_one_reg reg; unsigned long data; + int debug_fd = kvm_cpu__get_debug_fd(); reg.addr = (u64)data; - printf(*pc:\n); + dprintf(debug_fd, \n*pc:\n); reg.id = ARM64_CORE_REG(regs.pc); if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg) 0) die(KVM_GET_ONE_REG failed (show_code @ PC)); - kvm__dump_mem(vcpu-kvm, data, 32); - printf(\n); + kvm__dump_mem(vcpu-kvm, data, 32, debug_fd); - printf(*lr:\n); + dprintf(debug_fd, \n*lr:\n); reg.id = ARM64_CORE_REG(regs.regs[30]); if (ioctl(vcpu-vcpu_fd, KVM_GET_ONE_REG, reg) 0) die(KVM_GET_ONE_REG failed (show_code @ LR)); - kvm__dump_mem(vcpu-kvm, data, 32); - printf(\n); + kvm__dump_mem(vcpu-kvm, data, 32, debug_fd); } void kvm_cpu__show_registers(struct kvm_cpu *vcpu) -- 1.8.0 -- Asias He -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] tcm_vhost: Multi-queue support
On Wed, 2013-02-06 at 13:20 +0800, Asias He wrote: This adds virtio-scsi multi-queue support to tcm_vhost. Guest side virtio-scsi multi-queue support can be found here: https://lkml.org/lkml/2012/12/18/166 Some initial perf numbers: 1 queue, 4 targets, 1 lun per target 4K request size, 50% randread + 50% randwrite: 127K/127k IOPS 4 queues, 4 targets, 1 lun per target 4K request size, 50% randread + 50% randwrite: 181K/181k IOPS Nice single LUN small block random I/O improvement here with 4x vqueues. Curious to see how virtio-scsi small block performance looks with SCSI-core to multi-LUN tcm_vhost endpoints as well.. 8-) Btw, this does not apply atop current target-pending.git/for-next with your other pending vhost patch series, and AFAICT this patch is supposed to apply on top of your last PATCH-v3, no..? --nab Signed-off-by: Asias He as...@redhat.com --- drivers/vhost/tcm_vhost.c | 46 +- drivers/vhost/tcm_vhost.h | 2 ++ 2 files changed, 31 insertions(+), 17 deletions(-) diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c index 81ecda5..9951297 100644 --- a/drivers/vhost/tcm_vhost.c +++ b/drivers/vhost/tcm_vhost.c @@ -48,6 +48,7 @@ #include linux/virtio_net.h /* TODO vhost.h currently depends on this */ #include linux/virtio_scsi.h #include linux/llist.h +#include linux/bitmap.h #include vhost.c #include vhost.h @@ -59,7 +60,8 @@ enum { VHOST_SCSI_VQ_IO = 2, }; -#define VHOST_SCSI_MAX_TARGET 256 +#define VHOST_SCSI_MAX_TARGET256 +#define VHOST_SCSI_MAX_VQ128 struct vhost_scsi { /* Protected by vhost_scsi-dev.mutex */ @@ -68,7 +70,7 @@ struct vhost_scsi { bool vs_endpoint; struct vhost_dev dev; - struct vhost_virtqueue vqs[3]; + struct vhost_virtqueue vqs[VHOST_SCSI_MAX_VQ]; struct vhost_work vs_completion_work; /* cmd completion work item */ struct llist_head vs_completion_list; /* cmd completion queue */ @@ -366,12 +368,14 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work *work) { struct vhost_scsi *vs = container_of(work, struct vhost_scsi, vs_completion_work); + DECLARE_BITMAP(signal, VHOST_SCSI_MAX_VQ); struct virtio_scsi_cmd_resp v_rsp; struct tcm_vhost_cmd *tv_cmd; struct llist_node *llnode; struct se_cmd *se_cmd; - int ret; + int ret, vq; + bitmap_zero(signal, VHOST_SCSI_MAX_VQ); llnode = llist_del_all(vs-vs_completion_list); while (llnode) { tv_cmd = llist_entry(llnode, struct tcm_vhost_cmd, @@ -390,15 +394,20 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work *work) memcpy(v_rsp.sense, tv_cmd-tvc_sense_buf, v_rsp.sense_len); ret = copy_to_user(tv_cmd-tvc_resp, v_rsp, sizeof(v_rsp)); - if (likely(ret == 0)) - vhost_add_used(vs-vqs[2], tv_cmd-tvc_vq_desc, 0); - else + if (likely(ret == 0)) { + vhost_add_used(tv_cmd-tvc_vq, tv_cmd-tvc_vq_desc, 0); + vq = tv_cmd-tvc_vq - vs-vqs; + __set_bit(vq, signal); + } else pr_err(Faulted on virtio_scsi_cmd_resp\n); vhost_scsi_free_cmd(tv_cmd); } - vhost_signal(vs-dev, vs-vqs[2]); + vq = -1; + while ((vq = find_next_bit(signal, VHOST_SCSI_MAX_VQ, vq + 1)) + VHOST_SCSI_MAX_VQ) + vhost_signal(vs-dev, vs-vqs[vq]); } static struct tcm_vhost_cmd *vhost_scsi_allocate_cmd( @@ -561,9 +570,9 @@ static void tcm_vhost_submission_work(struct work_struct *work) } } -static void vhost_scsi_handle_vq(struct vhost_scsi *vs) +static void vhost_scsi_handle_vq(struct vhost_scsi *vs, + struct vhost_virtqueue *vq) { - struct vhost_virtqueue *vq = vs-vqs[2]; struct virtio_scsi_cmd_req v_req; struct tcm_vhost_tpg *tv_tpg; struct tcm_vhost_cmd *tv_cmd; @@ -656,7 +665,7 @@ static void vhost_scsi_handle_vq(struct vhost_scsi *vs) ret = __copy_to_user(resp, rsp, sizeof(rsp)); if (!ret) vhost_add_used_and_signal(vs-dev, - vs-vqs[2], head, 0); + vq, head, 0); else pr_err(Faulted on virtio_scsi_cmd_resp\n); @@ -678,6 +687,7 @@ static void vhost_scsi_handle_vq(struct vhost_scsi *vs) : %d\n, tv_cmd, exp_data_len, data_direction); tv_cmd-tvc_vhost = vs; + tv_cmd-tvc_vq = vq; if (unlikely(vq-iov[out].iov_len != sizeof(struct virtio_scsi_cmd_resp))) { @@ -758,7 +768,7 @@ static
Re: [PATCH] tcm_vhost: Multi-queue support
On 02/06/2013 02:45 PM, Nicholas A. Bellinger wrote: On Wed, 2013-02-06 at 13:20 +0800, Asias He wrote: This adds virtio-scsi multi-queue support to tcm_vhost. Guest side virtio-scsi multi-queue support can be found here: https://lkml.org/lkml/2012/12/18/166 Some initial perf numbers: 1 queue, 4 targets, 1 lun per target 4K request size, 50% randread + 50% randwrite: 127K/127k IOPS 4 queues, 4 targets, 1 lun per target 4K request size, 50% randread + 50% randwrite: 181K/181k IOPS Nice single LUN small block random I/O improvement here with 4x vqueues. Curious to see how virtio-scsi small block performance looks with SCSI-core to multi-LUN tcm_vhost endpoints as well.. 8-) Do you mean something like this? 1 queue, 2 targets, 2 lun per target 4 queue, 2 targets, 2 lun per target Btw, this does not apply atop current target-pending.git/for-next with your other pending vhost patch series, and AFAICT this patch is supposed to apply on top of your last PATCH-v3, no..? Ah, this applies on top of mst's 'tcm_vhost: fix pr_err on early kick patch.' plus my last v3 of 'tcm_vhost: Multi-target support'. --nab Signed-off-by: Asias He as...@redhat.com --- drivers/vhost/tcm_vhost.c | 46 +- drivers/vhost/tcm_vhost.h | 2 ++ 2 files changed, 31 insertions(+), 17 deletions(-) diff --git a/drivers/vhost/tcm_vhost.c b/drivers/vhost/tcm_vhost.c index 81ecda5..9951297 100644 --- a/drivers/vhost/tcm_vhost.c +++ b/drivers/vhost/tcm_vhost.c @@ -48,6 +48,7 @@ #include linux/virtio_net.h /* TODO vhost.h currently depends on this */ #include linux/virtio_scsi.h #include linux/llist.h +#include linux/bitmap.h #include vhost.c #include vhost.h @@ -59,7 +60,8 @@ enum { VHOST_SCSI_VQ_IO = 2, }; -#define VHOST_SCSI_MAX_TARGET 256 +#define VHOST_SCSI_MAX_TARGET 256 +#define VHOST_SCSI_MAX_VQ 128 struct vhost_scsi { /* Protected by vhost_scsi-dev.mutex */ @@ -68,7 +70,7 @@ struct vhost_scsi { bool vs_endpoint; struct vhost_dev dev; -struct vhost_virtqueue vqs[3]; +struct vhost_virtqueue vqs[VHOST_SCSI_MAX_VQ]; struct vhost_work vs_completion_work; /* cmd completion work item */ struct llist_head vs_completion_list; /* cmd completion queue */ @@ -366,12 +368,14 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work *work) { struct vhost_scsi *vs = container_of(work, struct vhost_scsi, vs_completion_work); +DECLARE_BITMAP(signal, VHOST_SCSI_MAX_VQ); struct virtio_scsi_cmd_resp v_rsp; struct tcm_vhost_cmd *tv_cmd; struct llist_node *llnode; struct se_cmd *se_cmd; -int ret; +int ret, vq; +bitmap_zero(signal, VHOST_SCSI_MAX_VQ); llnode = llist_del_all(vs-vs_completion_list); while (llnode) { tv_cmd = llist_entry(llnode, struct tcm_vhost_cmd, @@ -390,15 +394,20 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work *work) memcpy(v_rsp.sense, tv_cmd-tvc_sense_buf, v_rsp.sense_len); ret = copy_to_user(tv_cmd-tvc_resp, v_rsp, sizeof(v_rsp)); -if (likely(ret == 0)) -vhost_add_used(vs-vqs[2], tv_cmd-tvc_vq_desc, 0); -else +if (likely(ret == 0)) { +vhost_add_used(tv_cmd-tvc_vq, tv_cmd-tvc_vq_desc, 0); +vq = tv_cmd-tvc_vq - vs-vqs; +__set_bit(vq, signal); +} else pr_err(Faulted on virtio_scsi_cmd_resp\n); vhost_scsi_free_cmd(tv_cmd); } -vhost_signal(vs-dev, vs-vqs[2]); +vq = -1; +while ((vq = find_next_bit(signal, VHOST_SCSI_MAX_VQ, vq + 1)) + VHOST_SCSI_MAX_VQ) +vhost_signal(vs-dev, vs-vqs[vq]); } static struct tcm_vhost_cmd *vhost_scsi_allocate_cmd( @@ -561,9 +570,9 @@ static void tcm_vhost_submission_work(struct work_struct *work) } } -static void vhost_scsi_handle_vq(struct vhost_scsi *vs) +static void vhost_scsi_handle_vq(struct vhost_scsi *vs, +struct vhost_virtqueue *vq) { -struct vhost_virtqueue *vq = vs-vqs[2]; struct virtio_scsi_cmd_req v_req; struct tcm_vhost_tpg *tv_tpg; struct tcm_vhost_cmd *tv_cmd; @@ -656,7 +665,7 @@ static void vhost_scsi_handle_vq(struct vhost_scsi *vs) ret = __copy_to_user(resp, rsp, sizeof(rsp)); if (!ret) vhost_add_used_and_signal(vs-dev, -vs-vqs[2], head, 0); + vq, head, 0); else pr_err(Faulted on virtio_scsi_cmd_resp\n); @@ -678,6 +687,7 @@ static void vhost_scsi_handle_vq(struct vhost_scsi *vs) : %d\n, tv_cmd, exp_data_len,