Re: [PATCH -v5 6/5] context_tracking: fix exception_enter when already in IN_KERNEL
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/12/2015 12:00 PM, Frederic Weisbecker wrote: On Thu, Feb 12, 2015 at 10:47:10AM -0500, Rik van Riel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/12/2015 10:42 AM, Frederic Weisbecker wrote: On Wed, Feb 11, 2015 at 02:43:19PM -0500, Rik van Riel wrote: If exception_enter happens when already in IN_KERNEL state, the code still calls context_tracking_exit, which ends up in rcu_eqs_exit_common, which explodes with a WARN_ON when it is called in a situation where dynticks are not enabled. Fortunately context_tracking_exit() already has a current_state == IN_KERNEL check so this shouldn't be a problem. No, it had a hard-coded current_state == IN_USER check, which is very close, but ... ... I replaced that with a state argument, and forgot to ensure that it never gets called with state == IN_KERNEL. This patch fixes that. Ah that's right! Well I'm going to merge this patch to 1/5 then to avoid breaking bisection. Thank you, Frederic! - -- All rights reversed -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJU3OeDAAoJEM553pKExN6D7BsIAJ8CKC73jQ8T5Dqa/tlHV7Db QFSJdpxP+7jCZwssehgpjpxCwtJ0UvGgle5OwX/POUhmagHxHmxVydOBz+xfYdBr UuGkEl5TL+oyoMUr80Q4RTnJSZN08zi+THqiv33tyPUj6cNiycBZAuho3ELTRNOA bRcHrMW+xd95uqoung7dSKrgA2jcym3+umNGnQb0gniraqcNLAmWs+jfAO8yZLJg vk8bIKed6epQ3n6gcdYe0A28cLOuBvjEs5JNcEPxujY/349sjitKR2pLQ6HsfHLV frlKsh7qQIRtoUJLO9ZBBDtGrmThwBwH8rw+GcVR8zviPNvV4IRrx47VBcHDWjc= =mwFO -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 92291] kvm/guest crashes when smp 1 with AMD FX8300; with host kernel oops from abrt as well
https://bugzilla.kernel.org/show_bug.cgi?id=92291 --- Comment #10 from Mark kernelbugzilla.org.mark...@dfgh.net --- well thanks very much guys, I'll cautiously say that a patched guest kernel seems to resolve it :-) the bug seemed to appear even when the host is untainted; serial log says [ 375.989736] divide error: [#1] SMP that's a Kubuntu partition that manages without the proprietary drivers while Fedora won't seem to give graphics properly with nouveau; not entirely satisfactory of course given that I've now had to modify a virtual machine that it was important to try to avoid modifying for testing purposes; although definitely better than crashing :-) do we keep the bug report open so that the host kvm handling bug gets fixed too / is this the right place to report that? -- You are receiving this mail because: You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/4] KVM: x86: avoid logical_map when it is invalid
We want to support mixed modes and the easiest solution is to avoid optimizing those weird and unlikely scenarios. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- v2 - optimize and name a check for valid map [Paolo] - don't use cluster id with invalid map [Paolo] - define KVM_APIC_MODE_* closer to kvm_apic_map (personal preference) arch/x86/include/asm/kvm_host.h | 5 + arch/x86/kvm/lapic.c| 24 +++- 2 files changed, 28 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index ea673e18b20f..63a4e03fbdcf 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -550,8 +550,13 @@ struct kvm_arch_memory_slot { struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1]; }; +#define KVM_APIC_MODE_XAPIC_CLUSTER 4 +#define KVM_APIC_MODE_XAPIC_FLAT 8 +#define KVM_APIC_MODE_X2APIC16 + struct kvm_apic_map { struct rcu_head rcu; + u8 mode; u8 ldr_bits; /* fields bellow are used to decode ldr values in different modes */ u32 cid_shift, cid_mask, lid_mask; diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 729bd7714790..88e2bf3be235 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -133,6 +133,14 @@ static inline int kvm_apic_id(struct kvm_lapic *apic) return (kvm_apic_get_reg(apic, APIC_ID) 24) 0xff; } +/* The logical map is definitely wrong if we have multiple + * modes at the same time. (Physical map is always right.) + */ +static inline bool kvm_apic_logical_map_valid(struct kvm_apic_map *map) +{ + return !(map-mode (map-mode - 1)); +} + static void recalculate_apic_map(struct kvm *kvm) { struct kvm_apic_map *new, *old = NULL; @@ -162,16 +170,19 @@ static void recalculate_apic_map(struct kvm *kvm) new-ldr_bits = 32; new-cid_shift = 16; new-cid_mask = new-lid_mask = 0x; + new-mode |= KVM_APIC_MODE_X2APIC; } else if (kvm_apic_get_reg(apic, APIC_LDR)) { if (kvm_apic_get_reg(apic, APIC_DFR) == APIC_DFR_CLUSTER) { new-cid_shift = 4; new-cid_mask = 0xf; new-lid_mask = 0xf; + new-mode |= KVM_APIC_MODE_XAPIC_CLUSTER; } else { new-cid_shift = 8; new-cid_mask = 0; new-lid_mask = 0xff; + new-mode |= KVM_APIC_MODE_XAPIC_FLAT; } } @@ -198,6 +209,10 @@ static void recalculate_apic_map(struct kvm *kvm) if (aid ARRAY_SIZE(new-phys_map)) new-phys_map[aid] = apic; + + if (!kvm_apic_logical_map_valid(new)); + continue; + if (lid cid ARRAY_SIZE(new-logical_map)) new-logical_map[cid][ffs(lid) - 1] = apic; } @@ -715,7 +730,14 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src, dst = map-phys_map[irq-dest_id]; } else { u32 mda = irq-dest_id (32 - map-ldr_bits); - u16 cid = apic_cluster_id(map, mda); + u16 cid; + + if (!kvm_apic_logical_map_valid(map)) { + ret = false; + goto out; + } + + cid = apic_cluster_id(map, mda); if (cid = ARRAY_SIZE(map-logical_map)) goto out; -- 2.3.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/4] KVM: x86: use MDA for interrupt matching
In mixed modes, we musn't deliver xAPIC IPIs like x2APIC and vice versa. Instead of preserving the information in apic_send_ipi(), we regain it by converting all destinations into correct MDA in the slow path. This allows easier reasoning about subsequent matching. Our kvm_apic_broadcast() had an interesting design decision: it didn't consider IOxAPIC 0xff as broadcast in x2APIC mode ... everything worked because IOxAPIC can't set that in physical mode and logical mode considered it as a message for first 8 VCPUs. This patch interprets IOxAPIC 0xff as x2APIC broadcast. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- v2 - accept two 'struct lapic *' in kvm_apic_mda() for nicer code [Paolo] - removed XXX comment about checking if x2APIC broadcast is accepted (it is, and the code also accepts any x2APIC message starting with 0xff, like v1, for simplicity.) arch/x86/kvm/lapic.c | 40 +--- 1 file changed, 33 insertions(+), 7 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 86609c15726f..4812bde5090c 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -585,15 +585,23 @@ static void apic_set_tpr(struct kvm_lapic *apic, u32 tpr) apic_update_ppr(apic); } -static bool kvm_apic_broadcast(struct kvm_lapic *apic, u32 dest) +static bool kvm_apic_broadcast(struct kvm_lapic *apic, u32 mda) { - return dest == (apic_x2apic_mode(apic) ? - X2APIC_BROADCAST : APIC_BROADCAST); + if (apic_x2apic_mode(apic)) + return mda == X2APIC_BROADCAST; + + return GET_APIC_DEST_FIELD(mda) == APIC_BROADCAST; } -static bool kvm_apic_match_physical_addr(struct kvm_lapic *apic, u32 dest) +static bool kvm_apic_match_physical_addr(struct kvm_lapic *apic, u32 mda) { - return kvm_apic_id(apic) == dest || kvm_apic_broadcast(apic, dest); + if (kvm_apic_broadcast(apic, mda)) + return true; + + if (apic_x2apic_mode(apic)) + return mda == kvm_apic_id(apic); + + return mda == SET_APIC_DEST_FIELD(kvm_apic_id(apic)); } static bool kvm_apic_match_logical_addr(struct kvm_lapic *apic, u32 mda) @@ -610,6 +618,7 @@ static bool kvm_apic_match_logical_addr(struct kvm_lapic *apic, u32 mda) (logical_id mda 0x) != 0; logical_id = GET_APIC_LOGICAL_ID(logical_id); + mda = GET_APIC_DEST_FIELD(mda); switch (kvm_apic_get_reg(apic, APIC_DFR)) { case APIC_DFR_FLAT: @@ -624,10 +633,27 @@ static bool kvm_apic_match_logical_addr(struct kvm_lapic *apic, u32 mda) } } +/* KVM APIC implementation has two quirks + * - dest always begins at 0 while xAPIC MDA has offset 24, + * - IOxAPIC messages have to be delivered (directly) to x2APIC. + */ +static u32 kvm_apic_mda(unsigned int dest_id, struct kvm_lapic *source, + struct kvm_lapic *target) +{ + bool ipi = source != NULL; + bool x2apic_mda = apic_x2apic_mode(ipi ? source : target); + + if (!ipi dest_id == APIC_BROADCAST x2apic_mda) + return X2APIC_BROADCAST; + + return x2apic_mda ? dest_id : SET_APIC_DEST_FIELD(dest_id); +} + bool kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source, int short_hand, unsigned int dest, int dest_mode) { struct kvm_lapic *target = vcpu-arch.apic; + u32 mda = kvm_apic_mda(dest, source, target); apic_debug(target %p, source %p, dest 0x%x, dest_mode 0x%x, short_hand 0x%x\n, @@ -637,9 +663,9 @@ bool kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct kvm_lapic *source, switch (short_hand) { case APIC_DEST_NOSHORT: if (dest_mode == APIC_DEST_PHYSICAL) - return kvm_apic_match_physical_addr(target, dest); + return kvm_apic_match_physical_addr(target, mda); else - return kvm_apic_match_logical_addr(target, dest); + return kvm_apic_match_logical_addr(target, mda); case APIC_DEST_SELF: return target == source; case APIC_DEST_ALLINC: -- 2.3.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 4/4] KVM: x86: simplify kvm_apic_map
recalculate_apic_map() uses two passes over all VCPUs. This is a relic from time when we selected a global mode in the first pass and set up the optimized table in the second pass (to have a consistent mode). Recent changes made mixed mode unoptimized and we can do it in one pass. Format of logical MDA is a function of the mode, so we encode it in apic_logical_id() and drop obsoleted variables from the struct. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- v2 - use different optimization to shorten apic_logical_id [Paolo] - move apic_logical_id to lapic.c [Paolo] - keep apic map validity check separate from apic_logical_id (It can be done now as it's optimized in a different way.) - rename apic_logical_id parameter from ldr to dest_id (To avoid hinting that it has anything to do with real APIC.) arch/x86/include/asm/kvm_host.h | 3 -- arch/x86/kvm/lapic.c| 75 ++--- arch/x86/kvm/lapic.h| 15 - 3 files changed, 25 insertions(+), 68 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 63a4e03fbdcf..c55dcaa5aead 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -557,9 +557,6 @@ struct kvm_arch_memory_slot { struct kvm_apic_map { struct rcu_head rcu; u8 mode; - u8 ldr_bits; - /* fields bellow are used to decode ldr values in different modes */ - u32 cid_shift, cid_mask, lid_mask; struct kvm_lapic *phys_map[256]; /* first index is cluster id second is cpu id in a cluster */ struct kvm_lapic *logical_map[16][16]; diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 88e2bf3be235..3661f79642c9 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -141,6 +141,17 @@ static inline bool kvm_apic_logical_map_valid(struct kvm_apic_map *map) return !(map-mode (map-mode - 1)); } +static inline void +apic_logical_id(struct kvm_apic_map *map, u32 dest_id, u16 *cid, u16 *lid) +{ + BUILD_BUG_ON(KVM_APIC_MODE_XAPIC_CLUSTER != 4); + BUILD_BUG_ON(KVM_APIC_MODE_XAPIC_FLAT!= 8); + BUILD_BUG_ON(KVM_APIC_MODE_X2APIC!= 16); + + *cid = dest_id map-mode; + *lid = dest_id ((1 map-mode) - 1); +} + static void recalculate_apic_map(struct kvm *kvm) { struct kvm_apic_map *new, *old = NULL; @@ -154,49 +165,6 @@ static void recalculate_apic_map(struct kvm *kvm) if (!new) goto out; - new-ldr_bits = 8; - /* flat mode is default */ - new-cid_shift = 8; - new-cid_mask = 0; - new-lid_mask = 0xff; - - kvm_for_each_vcpu(i, vcpu, kvm) { - struct kvm_lapic *apic = vcpu-arch.apic; - - if (!kvm_apic_present(vcpu)) - continue; - - if (apic_x2apic_mode(apic)) { - new-ldr_bits = 32; - new-cid_shift = 16; - new-cid_mask = new-lid_mask = 0x; - new-mode |= KVM_APIC_MODE_X2APIC; - } else if (kvm_apic_get_reg(apic, APIC_LDR)) { - if (kvm_apic_get_reg(apic, APIC_DFR) == - APIC_DFR_CLUSTER) { - new-cid_shift = 4; - new-cid_mask = 0xf; - new-lid_mask = 0xf; - new-mode |= KVM_APIC_MODE_XAPIC_CLUSTER; - } else { - new-cid_shift = 8; - new-cid_mask = 0; - new-lid_mask = 0xff; - new-mode |= KVM_APIC_MODE_XAPIC_FLAT; - } - } - - /* -* All APICs have to be configured in the same mode by an OS. -* We take advatage of this while building logical id loockup -* table. After reset APICs are in software disabled mode, so if -* we find apic with different setting we assume this is the mode -* OS wants all apics to be in; build lookup table accordingly. -*/ - if (kvm_apic_sw_enabled(apic)) - break; - } - kvm_for_each_vcpu(i, vcpu, kvm) { struct kvm_lapic *apic = vcpu-arch.apic; u16 cid, lid; @@ -204,15 +172,25 @@ static void recalculate_apic_map(struct kvm *kvm) aid = kvm_apic_id(apic); ldr = kvm_apic_get_reg(apic, APIC_LDR); - cid = apic_cluster_id(new, ldr); - lid = apic_logical_id(new, ldr); if (aid ARRAY_SIZE(new-phys_map)) new-phys_map[aid] = apic; - if (!kvm_apic_logical_map_valid(new)); + if (apic_x2apic_mode(apic)) { +
[PATCH v2 2/4] KVM: x86: fix mixed APIC mode broadcast
Broadcast allowed only one global APIC mode, but mixed modes are theoretically possible. x2APIC IPI doesn't mean 0xff as broadcast, the rest does. x2APIC broadcasts are accepted by xAPIC. If we take SDM to be logical, even addreses beginning with 0xff should be accepted, but real hardware disagrees. This patch aims for simple code by considering most of real behavior as undefined. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- v2 - don't deliver all phys messages beginning with 0xff to xAPIC arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/lapic.c| 9 - 2 files changed, 5 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index a236e39cc385..ea673e18b20f 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -554,7 +554,7 @@ struct kvm_apic_map { struct rcu_head rcu; u8 ldr_bits; /* fields bellow are used to decode ldr values in different modes */ - u32 cid_shift, cid_mask, lid_mask, broadcast; + u32 cid_shift, cid_mask, lid_mask; struct kvm_lapic *phys_map[256]; /* first index is cluster id second is cpu id in a cluster */ struct kvm_lapic *logical_map[16][16]; diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 4812bde5090c..729bd7714790 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -151,7 +151,6 @@ static void recalculate_apic_map(struct kvm *kvm) new-cid_shift = 8; new-cid_mask = 0; new-lid_mask = 0xff; - new-broadcast = APIC_BROADCAST; kvm_for_each_vcpu(i, vcpu, kvm) { struct kvm_lapic *apic = vcpu-arch.apic; @@ -163,7 +162,6 @@ static void recalculate_apic_map(struct kvm *kvm) new-ldr_bits = 32; new-cid_shift = 16; new-cid_mask = new-lid_mask = 0x; - new-broadcast = X2APIC_BROADCAST; } else if (kvm_apic_get_reg(apic, APIC_LDR)) { if (kvm_apic_get_reg(apic, APIC_DFR) == APIC_DFR_CLUSTER) { @@ -687,6 +685,7 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src, struct kvm_lapic **dst; int i; bool ret = false; + bool x2apic_ipi = src apic_x2apic_mode(src); *r = -1; @@ -698,15 +697,15 @@ bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src, if (irq-shorthand) return false; + if (irq-dest_id == (x2apic_ipi ? X2APIC_BROADCAST : APIC_BROADCAST)) + return false; + rcu_read_lock(); map = rcu_dereference(kvm-arch.apic_map); if (!map) goto out; - if (irq-dest_id == map-broadcast) - goto out; - ret = true; if (irq-dest_mode == APIC_DEST_PHYSICAL) { -- 2.3.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/4] KVM: APIC improvements (with bonus mixed mode)
Each patch has a diff from v1, here is only a prologue on the mythical mixed xAPIC and x2APIC mode: There is one interesting alias in xAPIC and x2APIC ICR destination, the 0xff00, which is a broadcast in xAPIC and either a physical message to high x2APIC ID or a message to an empty set in x2APIC cluster. This corner case in mixed mode can be triggered by a weird message 1) when we have no x2APIC ID 0xff00, but send x2APIC message there or after something that isn't forbidden in SDM most likely because they didn't think that anyone would ever consider it 2) we have x2APIC ID 0xff00 and reset other x2APICs into xAPIC mode Current KVM doesn't need to consider (2), so there only is a slim chance that some hobbyist OS pushes the specification to its limits. The problem is that SDM for xAPIC doesn't believably specify[1] if all messages beginning with 0xff are considered as broadcasts, 10.6.2.1: A broadcast IPI (bits 28-31 of the MDA are 1's) No volunteer came to clear this up, so I hacked Linux to have one x2APIC core between xAPIC ones. Physical IPI to 0xff00 got delivered only to CPU0, like most other destinations, Logical IPI to 0xff00 got dropped and only 0x worked as a broadcast in both modes. I think it is because Intel never considered mixed mode to be valid, and seen delivery might be an emergent feature of QPI routing. Luckily, broadcast from xAPIC isn't delivered to x2APIC. Real exploration would require greater modifications to Linux (ideally writing a custom kernel), so this series implements something that makes some sense and isn't too far from reality. Radim Krčmář (4): KVM: x86: use MDA for interrupt matching KVM: x86: fix mixed APIC mode broadcast KVM: x86: avoid logical_map when it is invalid KVM: x86: simplify kvm_apic_map arch/x86/include/asm/kvm_host.h | 8 ++- arch/x86/kvm/lapic.c| 138 +++- arch/x86/kvm/lapic.h| 15 - 3 files changed, 85 insertions(+), 76 deletions(-) -- 2.3.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V3] x86 spinlock: Fix memory corruption on completing completions
Paravirt spinlock clears slowpath flag after doing unlock. As explained by Linus currently it does: prev = *lock; add_smp(lock-tickets.head, TICKET_LOCK_INC); /* add_smp() is a full mb() */ if (unlikely(lock-tickets.tail TICKET_SLOWPATH_FLAG)) __ticket_unlock_slowpath(lock, prev); which is *exactly* the kind of things you cannot do with spinlocks, because after you've done the add_smp() and released the spinlock for the fast-path, you can't access the spinlock any more. Exactly because a fast-path lock might come in, and release the whole data structure. Linus suggested that we should not do any writes to lock after unlock(), and we can move slowpath clearing to fastpath lock. So this patch implements the fix with: 1. Moving slowpath flag to head (Oleg). 2. use xadd to avoid read/write after unlock that checks the need for unlock_kick (Linus). Result: setup: 16core (32 cpu +ht sandy bridge 8GB 16vcpu guest) benchmark overcommit %improve kernbench 1x -0.13 kernbench 2x0.02 dbench 1x -1.77 dbench 2x -0.63 [Jeremy: hinted missing TICKET_LOCK_INC for kick] [Oleg: Moving slowpath flag to head, ticket_equals idea] Reported-by: Sasha Levin sasha.le...@oracle.com Suggested-by: Linus Torvalds torva...@linux-foundation.org Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- arch/x86/include/asm/spinlock.h | 87 - arch/x86/kernel/kvm.c | 4 +- 2 files changed, 45 insertions(+), 46 deletions(-) potential TODO: * The whole patch be splitted into, 1. move slowpath flag 2. fix memory corruption in completion problem ?? * May be we could directly pass inc for __ticket_check_and_clear_slowpath but I hope current code is more readable. Changes since V2: - Move the slowpath flag to head, this enables xadd usage in unlock code and inturn we can get rid of read/write after unlock (Oleg) - usage of ticket_equals (Oleg) Changes since V1: - Add missing TICKET_LOCK_INC before unlock kick (fixes hang in overcommit: Jeremy). - Remove SLOWPATH_FLAG clearing in fast lock. (Jeremy) - clear SLOWPATH_FLAG in arch_spin_value_unlocked during comparison. Note: The current implementation is still based on avoid writing after unlock. we could still have potential invalid memory read. (Sasha) Result: setup: 16core (32 cpu +ht sandy bridge 8GB 16vcpu guest) base = 3_19_rc7 3_19_rc7_spinfix_v3 +---+---+---++---+ kernbench (Time taken in sec lower is better) +---+---+---++---+ base %stdevpatched %stdev %improve +---+---+---++---+ 1x 54.2300 3.0652 54.3008 4.0366-0.13056 2x 90.1883 5.5509 90.1650 6.4336 0.02583 +---+---+---++---+ +---+---+---++---+ dbench (Throughput higher is better) +---+---+---++---+ base %stdevpatched %stdev %improve +---+---+---++---+ 1x 7029.9188 2.5952 6905.0712 4.4737-1.77595 2x 3254.207514.8291 3233.713726.8784-0.62976 +---+---+---++---+ (here is the result I got from the patches, I believe there may be some small overhead from xadd etc, but overall looks fine but a thorough test may be needed) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 625660f..9697b45 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -46,7 +46,7 @@ static __always_inline bool static_key_false(struct static_key *key); static inline void __ticket_enter_slowpath(arch_spinlock_t *lock) { - set_bit(0, (volatile unsigned long *)lock-tickets.tail); + set_bit(0, (volatile unsigned long *)lock-tickets.head); } #else /* !CONFIG_PARAVIRT_SPINLOCKS */ @@ -60,10 +60,30 @@ static inline void __ticket_unlock_kick(arch_spinlock_t *lock, } #endif /* CONFIG_PARAVIRT_SPINLOCKS */ +static inline int __tickets_equal(__ticket_t one, __ticket_t two) +{ + return !((one ^ two) ~TICKET_SLOWPATH_FLAG); +} + +static inline void __ticket_check_and_clear_slowpath(arch_spinlock_t *lock, + __ticket_t head) +{ + if (head TICKET_SLOWPATH_FLAG) { + arch_spinlock_t old, new; + + old.tickets.head = head; + new.tickets.head = head ~TICKET_SLOWPATH_FLAG; + old.tickets.tail = new.tickets.head + TICKET_LOCK_INC; + new.tickets.tail = old.tickets.tail; + + /* try to clear slowpath flag when there are no
[PATCH kvm-unit-test] tscdeadline_latency: add option to exit on max latency
Add option to exit in case latency exceeds a given maximum. Its useful to test host latency without the complexity of a guest. Example command line: echo 1 /sys/kernel/debug/tracing/tracing_on; taskset -c 3 qemu-kvm -enable-kvm -device testdev,chardev=testlog -chardev file,id=testlog,path=msr.out -display none -serial stdio -kernel x86/tscdeadline_latency.flat -cpu host -append 20 1 120050 /tmp/a; echo 0 /sys/kernel/debug/tracing/tracing_on Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/x86/tscdeadline_latency.c b/x86/tscdeadline_latency.c index 9f255c0..14ab8c8 100644 --- a/x86/tscdeadline_latency.c +++ b/x86/tscdeadline_latency.c @@ -9,6 +9,17 @@ * hist(latency, 50) */ +/* + * for host tracing of breakmax option: + * + * # cd /sys/kernel/debug/tracing/ + * # echo x86-tsc trace_clock + * # echo kvm_exit kvm_entry kvm_msr set_event + * # echo sched_switch $extratracepoints set_event + * # echo apic_timer_fn set_ftrace_filter + * # echo function current_tracer + */ + #include libcflat.h #include apic.h #include vm.h @@ -37,6 +48,8 @@ int delta; #define TABLE_SIZE 1 u64 table[TABLE_SIZE]; volatile int table_idx; +volatile int hitmax = 0; +int breakmax = 0; static void tsc_deadline_timer_isr(isr_regs_t *regs) { @@ -46,6 +59,12 @@ static void tsc_deadline_timer_isr(isr_regs_t *regs) if (table_idx TABLE_SIZE tdt_count 1) table[table_idx++] = now - exptime; +if (breakmax tdt_count 1 (now - exptime) breakmax) { +hitmax = 1; +apic_write(APIC_EOI, 0); +return; +} + exptime = now+delta; wrmsr(MSR_IA32_TSCDEADLINE, now+delta); apic_write(APIC_EOI, 0); @@ -97,16 +116,21 @@ int main(int argc, char **argv) mask_pic_interrupts(); delta = argc = 1 ? 20 : atol(argv[1]); -size = argc = 2 ? TABLE_SIZE : atol(argv[1]); +size = argc = 2 ? TABLE_SIZE : atol(argv[2]); +breakmax = argc = 3 ? 0 : atol(argv[3]); printf(breakmax=%d\n, breakmax); test_tsc_deadline_timer(); irq_enable(); do { asm volatile(hlt); -} while (table_idx size); +} while (!hitmax table_idx size); -for (i = 0; i size; i++) +for (i = 0; i table_idx; i++) { +if (hitmax i == table_idx-1) +printf(hit max: %d , breakmax); printf(latency: %d\n, table[i]); +} return report_summary(); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH kvm-unit-test v2] tscdeadline_latency: add option to exit on max latency
v2: fix typo --- Add option to exit in case latency exceeds a given maximum. Its useful to test host latency without the complexity of a guest. Example command line: echo 1 /sys/kernel/debug/tracing/tracing_on; taskset -c 3 qemu-kvm -enable-kvm -device testdev,chardev=testlog -chardev file,id=testlog,path=msr.out -display none -serial stdio -kernel x86/tscdeadline_latency.flat -cpu host -append 20 1 120050 /tmp/a; echo 0 /sys/kernel/debug/tracing/tracing_on Signed-off-by: Marcelo Tosatti mtosa...@redhat.com diff --git a/x86/tscdeadline_latency.c b/x86/tscdeadline_latency.c index 9f255c0..14ab8c8 100644 --- a/x86/tscdeadline_latency.c +++ b/x86/tscdeadline_latency.c @@ -9,6 +9,17 @@ * hist(latency, 50) */ +/* + * for host tracing of breakmax option: + * + * # cd /sys/kernel/debug/tracing/ + * # echo x86-tsc trace_clock + * # echo kvm_exit kvm_entry kvm_msr set_event + * # echo sched_switch $extratracepoints set_event + * # echo apic_timer_fn set_ftrace_filter + * # echo function current_tracer + */ + #include libcflat.h #include apic.h #include vm.h @@ -37,6 +48,8 @@ int delta; #define TABLE_SIZE 1 u64 table[TABLE_SIZE]; volatile int table_idx; +volatile int hitmax = 0; +int breakmax = 0; static void tsc_deadline_timer_isr(isr_regs_t *regs) { @@ -46,6 +59,12 @@ static void tsc_deadline_timer_isr(isr_regs_t *regs) if (table_idx TABLE_SIZE tdt_count 1) table[table_idx++] = now - exptime; +if (breakmax tdt_count 1 (now - exptime) breakmax) { +hitmax = 1; +apic_write(APIC_EOI, 0); +return; +} + exptime = now+delta; wrmsr(MSR_IA32_TSCDEADLINE, now+delta); apic_write(APIC_EOI, 0); @@ -97,16 +116,21 @@ int main(int argc, char **argv) mask_pic_interrupts(); delta = argc = 1 ? 20 : atol(argv[1]); -size = argc = 2 ? TABLE_SIZE : atol(argv[1]); +size = argc = 2 ? TABLE_SIZE : atol(argv[2]); +breakmax = argc = 3 ? 0 : atol(argv[3]); +printf(breakmax=%d\n, breakmax); test_tsc_deadline_timer(); irq_enable(); do { asm volatile(hlt); -} while (table_idx size); +} while (!hitmax table_idx size); -for (i = 0; i size; i++) +for (i = 0; i table_idx; i++) { +if (hitmax i == table_idx-1) +printf(hit max: %d , breakmax); printf(latency: %d\n, table[i]); +} return report_summary(); } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Google Summer of Code application submitted
Hi, I have submitted our Google Summer of Code application for QEMU, libvirt, and KVM. Accepted organizations will be announced on March 2nd at 19:00 UTC on http://google-melange.com/. You can still add project ideas if you wish to mentor: http://qemu-project.org/Google_Summer_of_Code_2015 Mentors must be active contributors and willing to commit 5 hours per week during the summer. Please let me know if you have any questions. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
VFIO iommu page size masking
Hi Alex, While trying to get VFIO-PCI working on AArch64 (with 64k page size), I stumbled over the following piece of code: static unsigned long vfio_pgsize_bitmap(struct vfio_iommu *iommu) { struct vfio_domain *domain; unsigned long bitmap = PAGE_MASK; mutex_lock(iommu-lock); list_for_each_entry(domain, iommu-domain_list, next) bitmap = domain-domain-ops-pgsize_bitmap; mutex_unlock(iommu-lock); return bitmap; } The SMMU page mask is [3.054302] arm-smmu e0a0.smmu: Supported page sizes: 0x40201000 but after this function, we end up supporting one 2MB pages and above. The reason for that is simple: You restrict the bitmap to PAGE_MASK and above. Now the big question is why you're doing that. I don't see why it would be a problem if the IOMMU maps a page in smaller chunks. So I tried to patch the code above with s/PAGE_MASK/1UL/ and everything seems to run fine. But maybe we're not lacking some sanity checks? Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: arm: warning at virt/kvm/arm/vgic.c:1468
Hi Jan, On Sun, Feb 08, 2015 at 08:48:09AM +0100, Jan Kiszka wrote: Hi, after fixing the VM_BUG_ON, my QEMU guest on the Jetson TK1 generally refuses to boot. Once in a while it does, but quickly gets stuck again. In one case I found this in the kernel log (never happened again so far): [ 762.022874] WARNING: CPU: 1 PID: 972 at ../arch/arm/kvm/../../../virt/kvm/arm/vgic.c:1468 kvm_vgic_sync_hwstate+0x314/0x344() [ 762.022884] Modules linked in: [ 762.022902] CPU: 1 PID: 972 Comm: qemu-system-arm Not tainted 3.19.0-rc7-00221-gfd7a168-dirty #13 [ 762.022911] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree) [ 762.022937] [c0025d80] (unwind_backtrace) from [c00215d8] (show_stack+0x10/0x14) [ 762.022958] [c00215d8] (show_stack) from [c065f494] (dump_stack+0x98/0xd8) [ 762.022976] [c065f494] (dump_stack) from [c0035f38] (warn_slowpath_common+0x80/0xb0) [ 762.022991] [c0035f38] (warn_slowpath_common) from [c0036004] (warn_slowpath_null+0x1c/0x24) [ 762.023007] [c0036004] (warn_slowpath_null) from [c001c3c4] (kvm_vgic_sync_hwstate+0x314/0x344) [ 762.023024] [c001c3c4] (kvm_vgic_sync_hwstate) from [c00147c0] (kvm_arch_vcpu_ioctl_run+0x210/0x400) [ 762.023041] [c00147c0] (kvm_arch_vcpu_ioctl_run) from [c001063c] (kvm_vcpu_ioctl+0x2e4/0x6ec) [ 762.023059] [c001063c] (kvm_vcpu_ioctl) from [c01098a4] (do_vfs_ioctl+0x40c/0x600) [ 762.023076] [c01098a4] (do_vfs_ioctl) from [c0109acc] (SyS_ioctl+0x34/0x5c) [ 762.023091] [c0109acc] (SyS_ioctl) from [c001e320] (ret_fast_syscall+0x0/0x34) so this means your guest caused a maintenance interrupt and the bit is set in the GICH_EISR for the LR in question but the link register state is not 0, which is in direct violation of the GIC spec. H. You're not doing any IRQ forwarding stuff or device passthrough here are you? BTW, KVM tracing support on ARM seems like it requires some care. E.g.: kvm_exit does not report an exit reason. The in-kernel vgic also seems to lack instrumentation. Unfortunate. Tracing is usually the first stop when KVM is stuck on a guest. I know, the exit reason is on my todo list, and Alex B is sitting on trace patches for the gic. Coming soon to a git repo near your. -Christoffer signature.asc Description: Digital signature
Re: [PATCH] ARM: KVM: Fix size check in __coherent_cache_guest_page
On Sat, Feb 07, 2015 at 10:21:20PM +0100, Jan Kiszka wrote: From: Jan Kiszka jan.kis...@siemens.com The check is supposed to catch page-unaligned sizes, not the inverse. Signed-off-by: Jan Kiszka jan.kis...@siemens.com Thanks, applied. -Christoffer signature.asc Description: Digital signature
Re: arm: warning at virt/kvm/arm/vgic.c:1468
Hi Christoffer, On 2015-02-13 05:46, Christoffer Dall wrote: Hi Jan, On Sun, Feb 08, 2015 at 08:48:09AM +0100, Jan Kiszka wrote: Hi, after fixing the VM_BUG_ON, my QEMU guest on the Jetson TK1 generally refuses to boot. Once in a while it does, but quickly gets stuck again. In one case I found this in the kernel log (never happened again so far): [ 762.022874] WARNING: CPU: 1 PID: 972 at ../arch/arm/kvm/../../../virt/kvm/arm/vgic.c:1468 kvm_vgic_sync_hwstate+0x314/0x344() [ 762.022884] Modules linked in: [ 762.022902] CPU: 1 PID: 972 Comm: qemu-system-arm Not tainted 3.19.0-rc7-00221-gfd7a168-dirty #13 [ 762.022911] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree) [ 762.022937] [c0025d80] (unwind_backtrace) from [c00215d8] (show_stack+0x10/0x14) [ 762.022958] [c00215d8] (show_stack) from [c065f494] (dump_stack+0x98/0xd8) [ 762.022976] [c065f494] (dump_stack) from [c0035f38] (warn_slowpath_common+0x80/0xb0) [ 762.022991] [c0035f38] (warn_slowpath_common) from [c0036004] (warn_slowpath_null+0x1c/0x24) [ 762.023007] [c0036004] (warn_slowpath_null) from [c001c3c4] (kvm_vgic_sync_hwstate+0x314/0x344) [ 762.023024] [c001c3c4] (kvm_vgic_sync_hwstate) from [c00147c0] (kvm_arch_vcpu_ioctl_run+0x210/0x400) [ 762.023041] [c00147c0] (kvm_arch_vcpu_ioctl_run) from [c001063c] (kvm_vcpu_ioctl+0x2e4/0x6ec) [ 762.023059] [c001063c] (kvm_vcpu_ioctl) from [c01098a4] (do_vfs_ioctl+0x40c/0x600) [ 762.023076] [c01098a4] (do_vfs_ioctl) from [c0109acc] (SyS_ioctl+0x34/0x5c) [ 762.023091] [c0109acc] (SyS_ioctl) from [c001e320] (ret_fast_syscall+0x0/0x34) so this means your guest caused a maintenance interrupt and the bit is set in the GICH_EISR for the LR in question but the link register state is not 0, which is in direct violation of the GIC spec. H. You're not doing any IRQ forwarding stuff or device passthrough here are you? No, just boring emulation. The command line is qemu-system-ar -machine vexpress-a15 -kernel zImage -serial mon:stdio -append 'console=ttyAMA0 root=/dev/mmcblk0 rw' -snapshot -sd OpenSuse13-1_arm.img -dtb vexpress-v2p-ca15-tc1.dtb -s -enable-kvm BTW, KVM tracing support on ARM seems like it requires some care. E.g.: kvm_exit does not report an exit reason. The in-kernel vgic also seems to lack instrumentation. Unfortunate. Tracing is usually the first stop when KVM is stuck on a guest. I know, the exit reason is on my todo list, and Alex B is sitting on trace patches for the gic. Coming soon to a git repo near your. Cool, looking forward. Next thing I noticed is that guest debugging via qemu causes troubles in kvm mode. For some reason, qemu is unable to write soft-breakpoints, thus not even a single-step works. Also known? Jan signature.asc Description: OpenPGP digital signature
Re: arm: warning at virt/kvm/arm/vgic.c:1468
Alex Bennée alex.ben...@linaro.org writes: Christoffer Dall christoffer.d...@linaro.org writes: snip On Sun, Feb 08, 2015 at 08:48:09AM +0100, Jan Kiszka wrote: snip BTW, KVM tracing support on ARM seems like it requires some care. E.g.: kvm_exit does not report an exit reason. The in-kernel vgic also seems to lack instrumentation. Unfortunate. Tracing is usually the first stop when KVM is stuck on a guest. I know, the exit reason is on my todo list, and Alex B is sitting on trace patches for the gic. Coming soon to a git repo near your. For the impatient the raw patches are in: git.linaro.org/people/alex.bennee/linux.git migration/v3.19-rc7-improve-tracing OK try tracing/kvm-exit-entry for something cleaner. -- Alex Bennée -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V4] x86 spinlock: Fix memory corruption on completing completions
Paravirt spinlock clears slowpath flag after doing unlock. As explained by Linus currently it does: prev = *lock; add_smp(lock-tickets.head, TICKET_LOCK_INC); /* add_smp() is a full mb() */ if (unlikely(lock-tickets.tail TICKET_SLOWPATH_FLAG)) __ticket_unlock_slowpath(lock, prev); which is *exactly* the kind of things you cannot do with spinlocks, because after you've done the add_smp() and released the spinlock for the fast-path, you can't access the spinlock any more. Exactly because a fast-path lock might come in, and release the whole data structure. Linus suggested that we should not do any writes to lock after unlock(), and we can move slowpath clearing to fastpath lock. So this patch implements the fix with: 1. Moving slowpath flag to head (Oleg): Unlocked locks don't care about the slowpath flag; therefore we can keep it set after the last unlock, and clear it again on the first (try)lock. -- this removes the write after unlock. note that keeping slowpath flag would result in unnecessary kicks. By moving the slowpath flag from the tail to the head ticket we also avoid the need to access both the head and tail tickets on unlock. 2. use xadd to avoid read/write after unlock that checks the need for unlock_kick (Linus): We further avoid the need for a read-after-release by using xadd; the prev head value will include the slowpath flag and indicate if we need to do PV kicking of suspended spinners -- on modern chips xadd isn't (much) more expensive than an add + load. Result: setup: 16core (32 cpu +ht sandy bridge 8GB 16vcpu guest) benchmark overcommit %improve kernbench 1x -0.13 kernbench 2x0.02 dbench 1x -1.77 dbench 2x -0.63 [Jeremy: hinted missing TICKET_LOCK_INC for kick] [Oleg: Moving slowpath flag to head, ticket_equals idea] [PeterZ: Detailed changelog] Reported-by: Sasha Levin sasha.le...@oracle.com Suggested-by: Linus Torvalds torva...@linux-foundation.org Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- arch/x86/include/asm/spinlock.h | 91 - arch/x86/kernel/kvm.c | 10 +++-- arch/x86/xen/spinlock.c | 10 +++-- 3 files changed, 56 insertions(+), 55 deletions(-) potential TODO: * The whole patch be splitted into, 1. move slowpath flag 2. fix memory corruption in completion problem ?? Changes since V3: - Detailed changelog (PeterZ) - Replace ACCESS_ONCE with READ_ONCE (oleg) - Add xen changes (Oleg) - Correct break logic in unlock_wait() (Oleg) Changes since V2: - Move the slowpath flag to head, this enables xadd usage in unlock code and inturn we can get rid of read/write after unlock (Oleg) - usage of ticket_equals (Oleg) Changes since V1: - Add missing TICKET_LOCK_INC before unlock kick (fixes hang in overcommit: Jeremy). - Remove SLOWPATH_FLAG clearing in fast lock. (Jeremy) - clear SLOWPATH_FLAG in arch_spin_value_unlocked during comparison. Note: The current implementation is still based on avoid writing after unlock. we could still have potential invalid memory read. (Sasha) Result: setup: 16core (32 cpu +ht sandy bridge 8GB 16vcpu guest) base = 3_19_rc7 3_19_rc7_spinfix_v3 +---+---+---++---+ kernbench (Time taken in sec lower is better) +---+---+---++---+ base %stdevpatched %stdev %improve +---+---+---++---+ 1x 54.2300 3.0652 54.3008 4.0366-0.13056 2x 90.1883 5.5509 90.1650 6.4336 0.02583 +---+---+---++---+ +---+---+---++---+ dbench (Throughput higher is better) +---+---+---++---+ base %stdevpatched %stdev %improve +---+---+---++---+ 1x 7029.9188 2.5952 6905.0712 4.4737-1.77595 2x 3254.207514.8291 3233.713726.8784-0.62976 +---+---+---++---+ (here is the result I got from the patches, I believe there may be some small overhead from xadd etc, but overall looks fine but a thorough test may be needed) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 625660f..646a1a3 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -46,7 +46,7 @@ static __always_inline bool static_key_false(struct static_key *key); static inline void __ticket_enter_slowpath(arch_spinlock_t *lock) { - set_bit(0, (volatile unsigned long *)lock-tickets.tail); + set_bit(0, (volatile unsigned long *)lock-tickets.head); } #else /* !CONFIG_PARAVIRT_SPINLOCKS */ @@ -60,10 +60,30 @@ static inline
Re: arm: warning at virt/kvm/arm/vgic.c:1468
Christoffer Dall christoffer.d...@linaro.org writes: Hi Jan, On Sun, Feb 08, 2015 at 08:48:09AM +0100, Jan Kiszka wrote: Hi, after fixing the VM_BUG_ON, my QEMU guest on the Jetson TK1 generally refuses to boot. Once in a while it does, but quickly gets stuck again. In one case I found this in the kernel log (never happened again so far): [ 762.022874] WARNING: CPU: 1 PID: 972 at ../arch/arm/kvm/../../../virt/kvm/arm/vgic.c:1468 kvm_vgic_sync_hwstate+0x314/0x344() [ 762.022884] Modules linked in: [ 762.022902] CPU: 1 PID: 972 Comm: qemu-system-arm Not tainted 3.19.0-rc7-00221-gfd7a168-dirty #13 [ 762.022911] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree) [ 762.022937] [c0025d80] (unwind_backtrace) from [c00215d8] (show_stack+0x10/0x14) [ 762.022958] [c00215d8] (show_stack) from [c065f494] (dump_stack+0x98/0xd8) [ 762.022976] [c065f494] (dump_stack) from [c0035f38] (warn_slowpath_common+0x80/0xb0) [ 762.022991] [c0035f38] (warn_slowpath_common) from [c0036004] (warn_slowpath_null+0x1c/0x24) [ 762.023007] [c0036004] (warn_slowpath_null) from [c001c3c4] (kvm_vgic_sync_hwstate+0x314/0x344) [ 762.023024] [c001c3c4] (kvm_vgic_sync_hwstate) from [c00147c0] (kvm_arch_vcpu_ioctl_run+0x210/0x400) [ 762.023041] [c00147c0] (kvm_arch_vcpu_ioctl_run) from [c001063c] (kvm_vcpu_ioctl+0x2e4/0x6ec) [ 762.023059] [c001063c] (kvm_vcpu_ioctl) from [c01098a4] (do_vfs_ioctl+0x40c/0x600) [ 762.023076] [c01098a4] (do_vfs_ioctl) from [c0109acc] (SyS_ioctl+0x34/0x5c) [ 762.023091] [c0109acc] (SyS_ioctl) from [c001e320] (ret_fast_syscall+0x0/0x34) so this means your guest caused a maintenance interrupt and the bit is set in the GICH_EISR for the LR in question but the link register state is not 0, which is in direct violation of the GIC spec. H. You're not doing any IRQ forwarding stuff or device passthrough here are you? BTW, KVM tracing support on ARM seems like it requires some care. E.g.: kvm_exit does not report an exit reason. The in-kernel vgic also seems to lack instrumentation. Unfortunate. Tracing is usually the first stop when KVM is stuck on a guest. I know, the exit reason is on my todo list, and Alex B is sitting on trace patches for the gic. Coming soon to a git repo near your. For the impatient the raw patches are in: git.linaro.org/people/alex.bennee/linux.git migration/v3.19-rc7-improve-tracing But I'll be cleaning the tracing ones up and separating them from the rest over the next few days. -Christoffer ___ kvmarm mailing list kvm...@lists.cs.columbia.edu https://lists.cs.columbia.edu/mailman/listinfo/kvmarm -- Alex Bennée -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: arm: warning at virt/kvm/arm/vgic.c:1468
On Fri, Feb 13, 2015 at 07:21:20AM +0100, Jan Kiszka wrote: Hi Christoffer, On 2015-02-13 05:46, Christoffer Dall wrote: Hi Jan, On Sun, Feb 08, 2015 at 08:48:09AM +0100, Jan Kiszka wrote: Hi, after fixing the VM_BUG_ON, my QEMU guest on the Jetson TK1 generally refuses to boot. Once in a while it does, but quickly gets stuck again. In one case I found this in the kernel log (never happened again so far): [ 762.022874] WARNING: CPU: 1 PID: 972 at ../arch/arm/kvm/../../../virt/kvm/arm/vgic.c:1468 kvm_vgic_sync_hwstate+0x314/0x344() [ 762.022884] Modules linked in: [ 762.022902] CPU: 1 PID: 972 Comm: qemu-system-arm Not tainted 3.19.0-rc7-00221-gfd7a168-dirty #13 [ 762.022911] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree) [ 762.022937] [c0025d80] (unwind_backtrace) from [c00215d8] (show_stack+0x10/0x14) [ 762.022958] [c00215d8] (show_stack) from [c065f494] (dump_stack+0x98/0xd8) [ 762.022976] [c065f494] (dump_stack) from [c0035f38] (warn_slowpath_common+0x80/0xb0) [ 762.022991] [c0035f38] (warn_slowpath_common) from [c0036004] (warn_slowpath_null+0x1c/0x24) [ 762.023007] [c0036004] (warn_slowpath_null) from [c001c3c4] (kvm_vgic_sync_hwstate+0x314/0x344) [ 762.023024] [c001c3c4] (kvm_vgic_sync_hwstate) from [c00147c0] (kvm_arch_vcpu_ioctl_run+0x210/0x400) [ 762.023041] [c00147c0] (kvm_arch_vcpu_ioctl_run) from [c001063c] (kvm_vcpu_ioctl+0x2e4/0x6ec) [ 762.023059] [c001063c] (kvm_vcpu_ioctl) from [c01098a4] (do_vfs_ioctl+0x40c/0x600) [ 762.023076] [c01098a4] (do_vfs_ioctl) from [c0109acc] (SyS_ioctl+0x34/0x5c) [ 762.023091] [c0109acc] (SyS_ioctl) from [c001e320] (ret_fast_syscall+0x0/0x34) so this means your guest caused a maintenance interrupt and the bit is set in the GICH_EISR for the LR in question but the link register state is not 0, which is in direct violation of the GIC spec. H. You're not doing any IRQ forwarding stuff or device passthrough here are you? No, just boring emulation. The command line is qemu-system-ar -machine vexpress-a15 -kernel zImage -serial mon:stdio -append 'console=ttyAMA0 root=/dev/mmcblk0 rw' -snapshot -sd OpenSuse13-1_arm.img -dtb vexpress-v2p-ca15-tc1.dtb -s -enable-kvm BTW, KVM tracing support on ARM seems like it requires some care. E.g.: kvm_exit does not report an exit reason. The in-kernel vgic also seems to lack instrumentation. Unfortunate. Tracing is usually the first stop when KVM is stuck on a guest. I know, the exit reason is on my todo list, and Alex B is sitting on trace patches for the gic. Coming soon to a git repo near your. Cool, looking forward. Next thing I noticed is that guest debugging via qemu causes troubles in kvm mode. For some reason, qemu is unable to write soft-breakpoints, thus not even a single-step works. Also known? Yes, Alex Bennee is working on this. -Christoffer signature.asc Description: Digital signature
Re: [PATCH V3] x86 spinlock: Fix memory corruption on completing completions
On 02/12, Raghavendra K T wrote: @@ -191,8 +189,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) * We need to check unlocked in a loop, tmp.head == head * can be false positive because of overflow. */ - if (tmp.head == (tmp.tail ~TICKET_SLOWPATH_FLAG) || - tmp.head != head) + if (__tickets_equal(tmp.head, tmp.tail) || tmp.head != head) break; Ah, it seems that tmp.head != head should be turned into !__tickets_equal(), no? Suppose that TICKET_SLOWPATH_FLAG is set after the first ACCESS_ONCE(head), then tmp.head != head will be true before the first unlock we are waiting for. And perhaps you can turn these ACCESS_ONCE into READ_ONCE as well. Oleg. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3] x86 spinlock: Fix memory corruption on completing completions
On 02/12/2015 07:20 PM, Oleg Nesterov wrote: On 02/12, Raghavendra K T wrote: @@ -191,8 +189,7 @@ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) * We need to check unlocked in a loop, tmp.head == head * can be false positive because of overflow. */ - if (tmp.head == (tmp.tail ~TICKET_SLOWPATH_FLAG) || - tmp.head != head) + if (__tickets_equal(tmp.head, tmp.tail) || tmp.head != head) break; Ah, it seems that tmp.head != head should be turned into !__tickets_equal(), no? Suppose that TICKET_SLOWPATH_FLAG is set after the first ACCESS_ONCE(head), then tmp.head != head will be true before the first unlock we are waiting for. Good catch. othewise we would wrongly break out even when somebody does halt wait. And perhaps you can turn these ACCESS_ONCE into READ_ONCE as well. Yes again :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3] x86 spinlock: Fix memory corruption on completing completions
Damn, sorry for noise, forgot to mention... On 02/12, Raghavendra K T wrote: +static inline void __ticket_check_and_clear_slowpath(arch_spinlock_t *lock, + __ticket_t head) +{ + if (head TICKET_SLOWPATH_FLAG) { + arch_spinlock_t old, new; + + old.tickets.head = head; + new.tickets.head = head ~TICKET_SLOWPATH_FLAG; + old.tickets.tail = new.tickets.head + TICKET_LOCK_INC; + new.tickets.tail = old.tickets.tail; + + /* try to clear slowpath flag when there are no contenders */ + cmpxchg(lock-head_tail, old.head_tail, new.head_tail); + } +} ... +clear_slowpath: + if (TICKET_SLOWPATH_FLAG) + __ticket_check_and_clear_slowpath(lock, inc.head); I think you can remove this if (TICKET_SLOWPATH_FLAG) check. If it is defined as zero, gcc should optimize out the code under if (head 0). But this is purely cosmetic and subjective, feel free to ignore. Oleg. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3] x86 spinlock: Fix memory corruption on completing completions
On 02/12/2015 07:07 PM, Oleg Nesterov wrote: On 02/12, Raghavendra K T wrote: @@ -772,7 +773,8 @@ __visible void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want) * check again make sure it didn't become free while * we weren't looking. */ - if (ACCESS_ONCE(lock-tickets.head) == want) { + head = ACCESS_ONCE(lock-tickets.head); + if (__tickets_equal(head, want)) { add_stats(TAKEN_SLOW_PICKUP, 1); While at it, perhaps it makes sense to s/ACCESS_ONCE/READ_ONCE/ but this is cosmetic. yes.. will do. We also need to change another user of enter_slow_path, xen_lock_spinning() in arch/x86/xen/spinlock.c. Had in mind but forgot before sending. will update resend. Other than that looks correct at first glance... but this is up to maintainers. Thanks -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3] x86 spinlock: Fix memory corruption on completing completions
On 02/12/2015 07:32 PM, Oleg Nesterov wrote: Damn, sorry for noise, forgot to mention... On 02/12, Raghavendra K T wrote: +static inline void __ticket_check_and_clear_slowpath(arch_spinlock_t *lock, + __ticket_t head) +{ + if (head TICKET_SLOWPATH_FLAG) { + arch_spinlock_t old, new; + + old.tickets.head = head; + new.tickets.head = head ~TICKET_SLOWPATH_FLAG; + old.tickets.tail = new.tickets.head + TICKET_LOCK_INC; + new.tickets.tail = old.tickets.tail; + + /* try to clear slowpath flag when there are no contenders */ + cmpxchg(lock-head_tail, old.head_tail, new.head_tail); + } +} ... +clear_slowpath: + if (TICKET_SLOWPATH_FLAG) + __ticket_check_and_clear_slowpath(lock, inc.head); I think you can remove this if (TICKET_SLOWPATH_FLAG) check. If it is defined as zero, gcc should optimize out the code under if (head 0). right, the above if ( ) is unnecesary, though we would have same code at the end, getting rid of that makes code more clean. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3] x86 spinlock: Fix memory corruption on completing completions
On 02/12, Raghavendra K T wrote: @@ -772,7 +773,8 @@ __visible void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want) * check again make sure it didn't become free while * we weren't looking. */ - if (ACCESS_ONCE(lock-tickets.head) == want) { + head = ACCESS_ONCE(lock-tickets.head); + if (__tickets_equal(head, want)) { add_stats(TAKEN_SLOW_PICKUP, 1); While at it, perhaps it makes sense to s/ACCESS_ONCE/READ_ONCE/ but this is cosmetic. We also need to change another user of enter_slow_path, xen_lock_spinning() in arch/x86/xen/spinlock.c. Other than that looks correct at first glance... but this is up to maintainers. Oleg. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3] x86 spinlock: Fix memory corruption on completing completions
On Thu, Feb 12, 2015 at 05:17:27PM +0530, Raghavendra K T wrote: Paravirt spinlock clears slowpath flag after doing unlock. As explained by Linus currently it does: prev = *lock; add_smp(lock-tickets.head, TICKET_LOCK_INC); /* add_smp() is a full mb() */ if (unlikely(lock-tickets.tail TICKET_SLOWPATH_FLAG)) __ticket_unlock_slowpath(lock, prev); which is *exactly* the kind of things you cannot do with spinlocks, because after you've done the add_smp() and released the spinlock for the fast-path, you can't access the spinlock any more. Exactly because a fast-path lock might come in, and release the whole data structure. Linus suggested that we should not do any writes to lock after unlock(), and we can move slowpath clearing to fastpath lock. So this patch implements the fix with: 1. Moving slowpath flag to head (Oleg). 2. use xadd to avoid read/write after unlock that checks the need for unlock_kick (Linus). Maybe spend a few more words explaining these things; something like: Unlocked locks don't care about the slowpath flag; therefore we can keep it set after the last unlock, as long as we clear it again on the first (try)lock -- this removes the write after unlock. By moving the slowpath flag from the tail to the head ticket we avoid the need to access both the head and tail tickets on unlock. We can further avoid the need for a read-after-release by using xadd; the prev head value will include the slowpath flag and indicate if we need to do PV kicking of suspended spinners -- on modern chips xadd isn't (much) more expensive than an add + load. Its 'obvious' now, but maybe not so much after we've all not looked at this for a few months. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3] x86 spinlock: Fix memory corruption on completing completions
On 02/12/2015 08:30 PM, Peter Zijlstra wrote: On Thu, Feb 12, 2015 at 05:17:27PM +0530, Raghavendra K T wrote: [...] Linus suggested that we should not do any writes to lock after unlock(), and we can move slowpath clearing to fastpath lock. So this patch implements the fix with: 1. Moving slowpath flag to head (Oleg). 2. use xadd to avoid read/write after unlock that checks the need for unlock_kick (Linus). Maybe spend a few more words explaining these things; something like: Unlocked locks don't care about the slowpath flag; therefore we can keep it set after the last unlock, as long as we clear it again on the first (try)lock -- this removes the write after unlock. Nit: I 'll reword a bit here since slowpath flag would result in unnecessary kick but otherwise harmless IMO. something like: Unlocked locks don't care about the slowpath flag; therefore we can keep it set after the last unlock, and clear it again on the first (try)lock. -- this removes the write after unlock. note that keeping slowpath flag would result in unnecessary kicks. By moving the slowpath flag from the tail to the head ticket we avoid the need to access both the head and tail tickets on unlock. We can further avoid the need for a read-after-release by using xadd; the prev head value will include the slowpath flag and indicate if we need to do PV kicking of suspended spinners -- on modern chips xadd isn't (much) more expensive than an add + load. Its 'obvious' now, but maybe not so much after we've all not looked at this for a few months. Absolutely correct. Thanks Peter for the detailed and very helpful writeup. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v5 6/5] context_tracking: fix exception_enter when already in IN_KERNEL
On Wed, Feb 11, 2015 at 02:43:19PM -0500, Rik van Riel wrote: If exception_enter happens when already in IN_KERNEL state, the code still calls context_tracking_exit, which ends up in rcu_eqs_exit_common, which explodes with a WARN_ON when it is called in a situation where dynticks are not enabled. Fortunately context_tracking_exit() already has a current_state == IN_KERNEL check so this shouldn't be a problem. Meanwhile I'll still take the patch, it's better to handle that from the caller. Thanks. This can be avoided by having exception_enter only switch to IN_KERNEL state if the current state is not already IN_KERNEL. Signed-off-by: Rik van Riel r...@redhat.com Reported-by: Luiz Capitulino lcapitul...@redhat.com --- Frederic, you will want this bonus patch, too :) Thanks to Luiz for finding this one. Whatever I was running did not trigger this issue... include/linux/context_tracking.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h index b65fd1420e53..9da230406e8c 100644 --- a/include/linux/context_tracking.h +++ b/include/linux/context_tracking.h @@ -37,7 +37,8 @@ static inline enum ctx_state exception_enter(void) return 0; prev_ctx = this_cpu_read(context_tracking.state); - context_tracking_exit(prev_ctx); + if (prev_ctx != IN_KERNEL) + context_tracking_exit(prev_ctx); return prev_ctx; } -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v5 0/5] rcu,nohz,kvm: use RCU extended quiescent state when running KVM guest
On Tue, Feb 10, 2015 at 03:27:49PM -0500, r...@redhat.com wrote: When running a KVM guest on a system with NOHZ_FULL enabled, and the KVM guest running with idle=poll mode, we still get wakeups of the rcuos/N threads. This problem has already been solved for user space by telling the RCU subsystem that the CPU is in an extended quiescent state while running user space code. This patch series extends that code a little bit to make it usable to track KVM guest space, too. I tested the code by booting a KVM guest with idle=poll, on a system with NOHZ_FULL enabled on most CPUs, and a VCPU thread bound to a CPU. In a 10 second interval, rcuos/N threads on other CPUs got woken up several times, while the rcuos thread on the CPU running the bound and alwasy running VCPU thread never got woken up once. Thanks to Christian Borntraeger, Paul McKenney, Paulo Bonzini, Frederic Weisbecker, and Will Deacon for reviewing and improving earlier versions of this patch series. So the patchset look good, thanks everyone. I'm applying the series and will see anything explode. I'll wait until the end of the merge window to push it to Ingo. Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] KVM changes for 3.20, except PPC
Linus, The following changes since commit 056bb5f51c357ee00046fde4929a03468ff45e7a: arm64: kvm: decode ESR_ELx.EC when reporting exceptions (2015-01-15 12:24:52 +) are available in the git repository at: git://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/for-linus for you to fetch changes up to 6557bada461afeaa920a189fae2cff7c8fdce39f: KVM: ia64: drop kvm.h from installed user headers (2015-02-10 09:02:10 +0100) I placed the correct conflict resolution in branch merge-resolution-3.20 of the KVM repository (based on commit 8cc748aa76c9 from your tree). In hindsight, I probably should have just delayed the last batch of ARM/ARM64 patches to the 3.20 merge window. The KVM/PPC changes will come in through the PPC maintainers, because I haven't received them yet and I might end up being offline for some part of next week. Fairly small update, but there are some interesting new features. Common: Optional support for adding a small amount of polling on each HLT instruction executed in the guest (or equivalent for other architectures). This can improve latency up to 50% on some scenarios (e.g. O_DSYNC writes or TCP_RR netperf tests). This also has to be enabled manually for now, but the plan is to auto-tune this in the future. ARM/ARM64: the highlights are support for GICv3 emulation and dirty page tracking s390: several optimizations and bugfixes. Also a first: a feature exposed by KVM (UUID and long guest name in /proc/sysinfo) before it is available in IBM's hypervisor! :) MIPS: Bugfixes. x86: Support for PML (page modification logging, a new feature in Broadwell Xeons that speeds up dirty page tracking), nested virtualization improvements (nested APICv---a nice optimization), usual round of emulation fixes. There is also a new option to reduce latency of the TSC deadline timer in the guest; this needs to be tuned manually. Some commits are common between this pull and Catalin's; I see you have already included his tree. ARM has other conflicts where functions are added in the same place by 3.19-rc and 3.20 patches. These are not large though, and entirely within KVM. Andre Przywara (21): ARM: KVM: extend WFI tracepoint to differentiate between wfi and wfe arm/arm64: KVM: rework MPIDR assignment and add accessors arm/arm64: KVM: pass down user space provided GIC type into vGIC code arm/arm64: KVM: refactor vgic_handle_mmio() function arm/arm64: KVM: wrap 64 bit MMIO accesses with two 32 bit ones arm/arm64: KVM: introduce per-VM ops arm/arm64: KVM: move kvm_register_device_ops() into vGIC probing arm/arm64: KVM: dont rely on a valid GICH base address arm/arm64: KVM: make the maximum number of vCPUs a per-VM value arm/arm64: KVM: make the value of ICC_SRE_EL1 a per-VM variable arm/arm64: KVM: refactor MMIO accessors arm/arm64: KVM: refactor/wrap vgic_set/get_attr() arm/arm64: KVM: add vgic.h header file arm/arm64: KVM: split GICv2 specific emulation code from vgic.c arm/arm64: KVM: add opaque private pointer to MMIO data arm/arm64: KVM: add virtual GICv3 distributor emulation arm64: GICv3: introduce symbolic names for GICv3 ICC_SGI1R_EL1 fields arm64: KVM: add SGI generation register emulation arm/arm64: KVM: enable kernel side of GICv3 emulation arm/arm64: KVM: allow userland to request a virtual GICv3 arm/arm64: KVM: force alignment of VGIC dist/CPU/redist addresses Borislav Petkov (1): kvm: Fix CR3_PCID_INVD type on 32-bit Christian Borntraeger (4): KVM: s390: make local function static KVM: s390: no need to hold the kvm-mutex for floating interrupts KVM: s390: reenable LPP facility KVM: Disable compat ioctl for s390 Christoffer Dall (2): arm/arm64: KVM: Fixup incorrect config symbol in comment KVM: Remove unused config symbol David Hildenbrand (13): KVM: s390: prevent sleep duration underflows in handle_wait() KVM: s390: base hrtimer on a monotonic clock KVM: s390: forward hrtimer if guest ckc not pending yet KVM: s390: new parameter for SIGP STOP irqs KVM: s390: handle stop irqs without action_bits KVM: s390: a VCPU may only stop when no interrupts are left pending KVM: s390: SIGP SET PREFIX cleanup s390/sclp: introduce check for the SIGP Interpretation Facility KVM: s390: only one external call may be pending at a time KVM: s390: clear the pfault queue if user space sets the invalid token KVM: s390: forward most SIGP orders to user space KVM: s390: avoid memory leaks if __inject_vm() fails KVM: s390: floating irqs: fix user triggerable endless loop Dominik Dingel (3): KVM: remove unneeded return value of vcpu_postcreate KVM: s390: move vcpu specific initalization to a later point
Re: [PATCH -v5 6/5] context_tracking: fix exception_enter when already in IN_KERNEL
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/12/2015 10:42 AM, Frederic Weisbecker wrote: On Wed, Feb 11, 2015 at 02:43:19PM -0500, Rik van Riel wrote: If exception_enter happens when already in IN_KERNEL state, the code still calls context_tracking_exit, which ends up in rcu_eqs_exit_common, which explodes with a WARN_ON when it is called in a situation where dynticks are not enabled. Fortunately context_tracking_exit() already has a current_state == IN_KERNEL check so this shouldn't be a problem. No, it had a hard-coded current_state == IN_USER check, which is very close, but ... ... I replaced that with a state argument, and forgot to ensure that it never gets called with state == IN_KERNEL. This patch fixes that. Meanwhile I'll still take the patch, it's better to handle that from the caller. Thanks. - -- All rights reversed -BEGIN PGP SIGNATURE- Version: GnuPG v1 iQEcBAEBAgAGBQJU3Mr+AAoJEM553pKExN6DYNUH/2m9CtXhLdTHOEHRvxg41PCZ /xafetUOS9cka0CNuiYpUuvfMSucoePW7YqUXqjYSIP25DsAleOh0qdep1Ob5bH+ 2BqZNMwK3QDHf1+/V7nulnjVkeHtpXJm0HIZOjc06xeL+9T6ydB1vhQGIMLrGL9S LvOstI3fseeIgglwYc2Gx7H7e99oOkxysvwMMvcMrW0cPSRAOdYxINQnfYW8A5kq DTTXwWuJRZa4FLtP3wLpvocm5dMGDwTsDmuOk1PmXYlsTsO6H2BmCeio0euzStoJ l+jR4x7Aq2KXES7gnMgpPw1iON3xKJ/RbXF8IC/doII8FYEV8Raxnf7hl47etBw= =yIjW -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions
On 02/11, Jeremy Fitzhardinge wrote: On 02/11/2015 09:24 AM, Oleg Nesterov wrote: I agree, and I have to admit I am not sure I fully understand why unlock uses the locked add. Except we need a barrier to avoid the race with the enter_slowpath() users, of course. Perhaps this is the only reason? Right now it needs to be a locked operation to prevent read-reordering. x86 memory ordering rules state that all writes are seen in a globally consistent order, and are globally ordered wrt reads *on the same addresses*, but reads to different addresses can be reordered wrt to writes. So, if the unlocking add were not a locked operation: __add(lock-tickets.head, TICKET_LOCK_INC); /* not locked */ if (unlikely(lock-tickets.tail TICKET_SLOWPATH_FLAG)) __ticket_unlock_slowpath(lock, prev); Then the read of lock-tickets.tail can be reordered before the unlock, which introduces a race: Yes, yes, thanks, but this is what I meant. We need a barrier. Even if Every store is a release as Linus mentioned. This *might* be OK, but I think it's on dubious ground: __add(lock-tickets.head, TICKET_LOCK_INC); /* not locked */ /* read overlaps write, and so is ordered */ if (unlikely(lock-head_tail (TICKET_SLOWPATH_FLAG TICKET_SHIFT)) __ticket_unlock_slowpath(lock, prev); because I think Intel and AMD differed in interpretation about how overlapping but different-sized reads writes are ordered (or it simply isn't architecturally defined). can't comment, I simply so not know how the hardware works. If the slowpath flag is moved to head, then it would always have to be locked anyway, because it needs to be atomic against other CPU's RMW operations setting the flag. Yes, this is true. But again, if we want to avoid the read-after-unlock, we need to update this lock and read SLOWPATH atomically, it seems that we can't avoid the locked insn. Oleg. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH -v5 6/5] context_tracking: fix exception_enter when already in IN_KERNEL
On Thu, Feb 12, 2015 at 10:47:10AM -0500, Rik van Riel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 02/12/2015 10:42 AM, Frederic Weisbecker wrote: On Wed, Feb 11, 2015 at 02:43:19PM -0500, Rik van Riel wrote: If exception_enter happens when already in IN_KERNEL state, the code still calls context_tracking_exit, which ends up in rcu_eqs_exit_common, which explodes with a WARN_ON when it is called in a situation where dynticks are not enabled. Fortunately context_tracking_exit() already has a current_state == IN_KERNEL check so this shouldn't be a problem. No, it had a hard-coded current_state == IN_USER check, which is very close, but ... ... I replaced that with a state argument, and forgot to ensure that it never gets called with state == IN_KERNEL. This patch fixes that. Ah that's right! Well I'm going to merge this patch to 1/5 then to avoid breaking bisection. Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html