Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead
On 05/03/2015 10:55 PM, Juergen Gross wrote: I did a small measurement of the pure locking functions on bare metal without and with my patches. spin_lock() for the first time (lock and code not in cache) dropped from about 600 to 500 cycles. spin_unlock() for first time dropped from 145 to 87 cycles. spin_lock() in a loop dropped from 48 to 45 cycles. spin_unlock() in the same loop dropped from 24 to 22 cycles. Did you isolate icache hot/cold from dcache hot/cold? It seems to me the main difference will be whether the branch predictor is warmed up rather than if the lock itself is in dcache, but its much more likely that the lock code is icache if the code is lock intensive, making the cold case moot. But that's pure speculation. Could you see any differences in workloads beyond microbenchmarks? Not that its my call at all, but I think we'd need to see some concrete improvements in real workloads before adding the complexity of more pvops. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead
On 04/30/2015 03:53 AM, Juergen Gross wrote: Paravirtualized spinlocks produce some overhead even if the kernel is running on bare metal. The main reason are the more complex locking and unlocking functions. Especially unlocking is no longer just one instruction but so complex that it is no longer inlined. This patch series addresses this issue by adding two more pvops functions to reduce the size of the inlined spinlock functions. When running on bare metal unlocking is again basically one instruction. Out of curiosity, is there a measurable difference? J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions
On 02/11/2015 09:24 AM, Oleg Nesterov wrote: I agree, and I have to admit I am not sure I fully understand why unlock uses the locked add. Except we need a barrier to avoid the race with the enter_slowpath() users, of course. Perhaps this is the only reason? Right now it needs to be a locked operation to prevent read-reordering. x86 memory ordering rules state that all writes are seen in a globally consistent order, and are globally ordered wrt reads *on the same addresses*, but reads to different addresses can be reordered wrt to writes. So, if the unlocking add were not a locked operation: __add(lock-tickets.head, TICKET_LOCK_INC);/* not locked */ if (unlikely(lock-tickets.tail TICKET_SLOWPATH_FLAG)) __ticket_unlock_slowpath(lock, prev); Then the read of lock-tickets.tail can be reordered before the unlock, which introduces a race: /* read reordered here */ if (unlikely(lock-tickets.tail TICKET_SLOWPATH_FLAG)) /* false */ /* ... */; /* other CPU sets SLOWPATH and blocks */ __add(lock-tickets.head, TICKET_LOCK_INC);/* not locked */ /* other CPU hung */ So it doesn't *have* to be a locked operation. This should also work: __add(lock-tickets.head, TICKET_LOCK_INC);/* not locked */ lfence(); /* prevent read reordering */ if (unlikely(lock-tickets.tail TICKET_SLOWPATH_FLAG)) __ticket_unlock_slowpath(lock, prev); but in practice a locked add is cheaper than an lfence (or at least was). This *might* be OK, but I think it's on dubious ground: __add(lock-tickets.head, TICKET_LOCK_INC);/* not locked */ /* read overlaps write, and so is ordered */ if (unlikely(lock-head_tail (TICKET_SLOWPATH_FLAG TICKET_SHIFT)) __ticket_unlock_slowpath(lock, prev); because I think Intel and AMD differed in interpretation about how overlapping but different-sized reads writes are ordered (or it simply isn't architecturally defined). If the slowpath flag is moved to head, then it would always have to be locked anyway, because it needs to be atomic against other CPU's RMW operations setting the flag. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions
On 02/11/2015 03:28 PM, Linus Torvalds wrote: On Feb 11, 2015 3:15 PM, Jeremy Fitzhardinge jer...@goop.org mailto:jer...@goop.org wrote: Right now it needs to be a locked operation to prevent read-reordering. x86 memory ordering rules state that all writes are seen in a globally consistent order, and are globally ordered wrt reads *on the same addresses*, but reads to different addresses can be reordered wrt to writes. The modern x86 rules are actually much tighter than that. Every store is a release, and every load is an acquire. So a non-atomic store is actually a perfectly fine unlock. All preceding stores will be seen by other cpu's before the unlock, and while reads can pass stores, they only pass *earlier* stores. Right, so in this particular instance, the read of the SLOWPATH flag *can't* pass the previous unlock store, hence the need for an atomic unlock or some other mechanism to prevent the read from being reordered. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions
On 02/10/2015 05:26 AM, Oleg Nesterov wrote: On 02/10, Raghavendra K T wrote: On 02/10/2015 06:23 AM, Linus Torvalds wrote: add_smp(lock-tickets.head, TICKET_LOCK_INC); if (READ_ONCE(lock-tickets.tail) TICKET_SLOWPATH_FLAG) .. into something like val = xadd((lock-ticket.head_tail, TICKET_LOCK_INC TICKET_SHIFT); if (unlikely(val TICKET_SLOWPATH_FLAG)) ... would be the right thing to do. Somebody should just check that I got that shift right, and that the tail is in the high bytes (head really needs to be high to work, if it's in the low byte(s) the xadd would overflow from head into tail which would be wrong). Unfortunately xadd could result in head overflow as tail is high. The other option was repeated cmpxchg which is bad I believe. Any suggestions? Stupid question... what if we simply move SLOWPATH from .tail to .head? In this case arch_spin_unlock() could do xadd(tickets.head) and check the result Well, right now, tail is manipulated by locked instructions by CPUs who are contending for the ticketlock, but head can be manipulated unlocked by the CPU which currently owns the ticketlock. If SLOWPATH moved into head, then non-owner CPUs would be touching head, requiring everyone to use locked instructions on it. That's the theory, but I don't see much (any?) code which depends on that. Ideally we could find a way so that pv ticketlocks could use a plain unlocked add for the unlock like the non-pv case, but I just don't see a way to do it. In this case __ticket_check_and_clear_slowpath() really needs to cmpxchg the whole .head_tail. Plus obviously more boring changes. This needs a separate patch even _if_ this can work. Definitely. BTW. If we move clear slowpath into lock path, then probably trylock should be changed too? Something like below, we just need to clear SLOWPATH before cmpxchg. How important / widely used is trylock these days? J Oleg. --- x/arch/x86/include/asm/spinlock.h +++ x/arch/x86/include/asm/spinlock.h @@ -109,7 +109,8 @@ static __always_inline int arch_spin_try if (old.tickets.head != (old.tickets.tail ~TICKET_SLOWPATH_FLAG)) return 0; - new.head_tail = old.head_tail + (TICKET_LOCK_INC TICKET_SHIFT); + new.tickets.head = old.tickets.head; + new.tickets.tail = (old.tickets.tail ~TICKET_SLOWPATH_FLAG) + TICKET_LOCK_INC; /* cmpxchg is a full barrier, so nothing can move before it */ return cmpxchg(lock-head_tail, old.head_tail, new.head_tail) == old.head_tail; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] x86 spinlock: Fix memory corruption on completing completions
On 02/06/2015 06:49 AM, Raghavendra K T wrote: Paravirt spinlock clears slowpath flag after doing unlock. As explained by Linus currently it does: prev = *lock; add_smp(lock-tickets.head, TICKET_LOCK_INC); /* add_smp() is a full mb() */ if (unlikely(lock-tickets.tail TICKET_SLOWPATH_FLAG)) __ticket_unlock_slowpath(lock, prev); which is *exactly* the kind of things you cannot do with spinlocks, because after you've done the add_smp() and released the spinlock for the fast-path, you can't access the spinlock any more. Exactly because a fast-path lock might come in, and release the whole data structure. Yeah, that's an embarrasingly obvious bug in retrospect. Linus suggested that we should not do any writes to lock after unlock(), and we can move slowpath clearing to fastpath lock. Yep, that seems like a sound approach. However it brings additional case to be handled, viz., slowpath still could be set when somebody does arch_trylock. Handle that too by ignoring slowpath flag during lock availability check. Reported-by: Sasha Levin sasha.le...@oracle.com Suggested-by: Linus Torvalds torva...@linux-foundation.org Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- arch/x86/include/asm/spinlock.h | 70 - 1 file changed, 34 insertions(+), 36 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 625660f..0829f86 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -49,6 +49,23 @@ static inline void __ticket_enter_slowpath(arch_spinlock_t *lock) set_bit(0, (volatile unsigned long *)lock-tickets.tail); } +static inline void __ticket_check_and_clear_slowpath(arch_spinlock_t *lock) +{ + arch_spinlock_t old, new; + __ticket_t diff; + + old.tickets = READ_ONCE(lock-tickets); Couldn't the caller pass in the lock state that it read rather than re-reading it? + diff = (old.tickets.tail ~TICKET_SLOWPATH_FLAG) - old.tickets.head; + + /* try to clear slowpath flag when there are no contenders */ + if ((old.tickets.tail TICKET_SLOWPATH_FLAG) + (diff == TICKET_LOCK_INC)) { + new = old; + new.tickets.tail = ~TICKET_SLOWPATH_FLAG; + cmpxchg(lock-head_tail, old.head_tail, new.head_tail); + } +} + #else /* !CONFIG_PARAVIRT_SPINLOCKS */ static __always_inline void __ticket_lock_spinning(arch_spinlock_t *lock, __ticket_t ticket) @@ -59,6 +76,10 @@ static inline void __ticket_unlock_kick(arch_spinlock_t *lock, { } +static inline void __ticket_check_and_clear_slowpath(arch_spinlock_t *lock) +{ +} + #endif /* CONFIG_PARAVIRT_SPINLOCKS */ static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock) @@ -84,7 +105,7 @@ static __always_inline void arch_spin_lock(arch_spinlock_t *lock) register struct __raw_tickets inc = { .tail = TICKET_LOCK_INC }; inc = xadd(lock-tickets, inc); - if (likely(inc.head == inc.tail)) + if (likely(inc.head == (inc.tail ~TICKET_SLOWPATH_FLAG))) The intent of this conditional was to be the quickest possible path when taking a fastpath lock, with the code below being used for all slowpath locks (free or taken). So I don't think masking out SLOWPATH_FLAG is necessary here. goto out; inc.tail = ~TICKET_SLOWPATH_FLAG; @@ -98,7 +119,10 @@ static __always_inline void arch_spin_lock(arch_spinlock_t *lock) } while (--count); __ticket_lock_spinning(lock, inc.tail); } -out: barrier(); /* make sure nothing creeps before the lock is taken */ +out: + __ticket_check_and_clear_slowpath(lock); + + barrier(); /* make sure nothing creeps before the lock is taken */ Which means that if goto out path is only ever used for fastpath locks, you can limit calling __ticket_check_and_clear_slowpath() to the slowpath case. } static __always_inline int arch_spin_trylock(arch_spinlock_t *lock) @@ -115,47 +139,21 @@ static __always_inline int arch_spin_trylock(arch_spinlock_t *lock) return cmpxchg(lock-head_tail, old.head_tail, new.head_tail) == old.head_tail; } -static inline void __ticket_unlock_slowpath(arch_spinlock_t *lock, - arch_spinlock_t old) -{ - arch_spinlock_t new; - - BUILD_BUG_ON(((__ticket_t)NR_CPUS) != NR_CPUS); - - /* Perform the unlock on the before copy */ - old.tickets.head += TICKET_LOCK_INC; NB (see below) - - /* Clear the slowpath flag */ - new.head_tail = old.head_tail ~(TICKET_SLOWPATH_FLAG TICKET_SHIFT); - - /* - * If the lock is uncontended, clear the flag - use cmpxchg in - * case it changes behind our back
Re: [PATCH delta V13 14/14] kvm : Paravirtual ticketlocks support for linux guests running on KVM hypervisor
On 08/13/2013 01:02 PM, Raghavendra K T wrote: * Ingo Molnar mi...@kernel.org [2013-08-13 18:55:52]: Would be nice to have a delta fix patch against tip:x86/spinlocks, which I'll then backmerge into that series via rebasing it. There was a namespace collision of PER_CPU lock_waiting variable when we have both Xen and KVM enabled. Perhaps this week wasn't for me. Had run 100 times randconfig in a loop for the fix sent earlier :(. Ingo, below delta patch should fix it, IIRC, I hope you will be folding this back to patch 14/14 itself. Else please let me. I have already run allnoconfig, allyesconfig, randconfig with below patch. But will test again. This should apply on top of tip:x86/spinlocks. ---8--- From: Raghavendra K T raghavendra...@linux.vnet.ibm.com Fix Namespace collision for lock_waiting Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index d442471..b8ef630 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -673,7 +673,7 @@ struct kvm_lock_waiting { static cpumask_t waiting_cpus; /* Track spinlock on which a cpu is waiting */ -static DEFINE_PER_CPU(struct kvm_lock_waiting, lock_waiting); +static DEFINE_PER_CPU(struct kvm_lock_waiting, klock_waiting); Has static stopped meaning static? J static void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want) { @@ -685,7 +685,7 @@ static void kvm_lock_spinning(struct arch_spinlock *lock, __ticket_t want) if (in_nmi()) return; - w = __get_cpu_var(lock_waiting); + w = __get_cpu_var(klock_waiting); cpu = smp_processor_id(); start = spin_time_start(); @@ -756,7 +756,7 @@ static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket) add_stats(RELEASED_SLOW, 1); for_each_cpu(cpu, waiting_cpus) { - const struct kvm_lock_waiting *w = per_cpu(lock_waiting, cpu); + const struct kvm_lock_waiting *w = per_cpu(klock_waiting, cpu); if (ACCESS_ONCE(w-lock) == lock ACCESS_ONCE(w-want) == ticket) { add_stats(RELEASED_SLOW_KICKED, 1); ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On 06/01/2013 01:14 PM, Andi Kleen wrote: FWIW I use the paravirt spinlock ops for adding lock elision to the spinlocks. Does lock elision still use the ticketlock algorithm/structure, or are they different? If they're still basically ticketlocks, then it seems to me that they're complimentary - hle handles the fastpath, and pv the slowpath. This needs to be done at the top level (so the level you're removing) However I don't like the pv mechanism very much and would be fine with using an static key hook in the main path like I do for all the other lock types. Right. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V9 1/19] x86/spinlock: Replace pv spinlocks with pv ticketlocks
On 06/01/2013 12:21 PM, Raghavendra K T wrote: x86/spinlock: Replace pv spinlocks with pv ticketlocks From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com I'm not sure what the etiquette is here; I did the work while at Citrix, but jer...@goop.org is my canonical email address. The Citrix address is dead and bounces, so is useless for anything. Probably best to change it. J Rather than outright replacing the entire spinlock implementation in order to paravirtualize it, keep the ticket lock implementation but add a couple of pvops hooks on the slow patch (long spin on lock, unlocking a contended lock). Ticket locks have a number of nice properties, but they also have some surprising behaviours in virtual environments. They enforce a strict FIFO ordering on cpus trying to take a lock; however, if the hypervisor scheduler does not schedule the cpus in the correct order, the system can waste a huge amount of time spinning until the next cpu can take the lock. (See Thomas Friebel's talk Prevent Guests from Spinning Around http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.) To address this, we add two hooks: - __ticket_spin_lock which is called after the cpu has been spinning on the lock for a significant number of iterations but has failed to take the lock (presumably because the cpu holding the lock has been descheduled). The lock_spinning pvop is expected to block the cpu until it has been kicked by the current lock holder. - __ticket_spin_unlock, which on releasing a contended lock (there are more cpus with tail tickets), it looks to see if the next cpu is blocked and wakes it if so. When compiled with CONFIG_PARAVIRT_SPINLOCKS disabled, a set of stub functions causes all the extra code to go away. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Reviewed-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com Tested-by: Attilio Rao attilio@citrix.com [ Raghavendra: Changed SPIN_THRESHOLD ] Signed-off-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com --- arch/x86/include/asm/paravirt.h | 32 arch/x86/include/asm/paravirt_types.h | 10 ++ arch/x86/include/asm/spinlock.h | 53 +++-- arch/x86/include/asm/spinlock_types.h |4 -- arch/x86/kernel/paravirt-spinlocks.c | 15 + arch/x86/xen/spinlock.c |8 - 6 files changed, 61 insertions(+), 61 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index cfdc9ee..040e72d 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -712,36 +712,16 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx, #if defined(CONFIG_SMP) defined(CONFIG_PARAVIRT_SPINLOCKS) -static inline int arch_spin_is_locked(struct arch_spinlock *lock) +static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock, + __ticket_t ticket) { - return PVOP_CALL1(int, pv_lock_ops.spin_is_locked, lock); + PVOP_VCALL2(pv_lock_ops.lock_spinning, lock, ticket); } -static inline int arch_spin_is_contended(struct arch_spinlock *lock) +static __always_inline void ticket_unlock_kick(struct arch_spinlock *lock, + __ticket_t ticket) { - return PVOP_CALL1(int, pv_lock_ops.spin_is_contended, lock); -} -#define arch_spin_is_contended arch_spin_is_contended - -static __always_inline void arch_spin_lock(struct arch_spinlock *lock) -{ - PVOP_VCALL1(pv_lock_ops.spin_lock, lock); -} - -static __always_inline void arch_spin_lock_flags(struct arch_spinlock *lock, - unsigned long flags) -{ - PVOP_VCALL2(pv_lock_ops.spin_lock_flags, lock, flags); -} - -static __always_inline int arch_spin_trylock(struct arch_spinlock *lock) -{ - return PVOP_CALL1(int, pv_lock_ops.spin_trylock, lock); -} - -static __always_inline void arch_spin_unlock(struct arch_spinlock *lock) -{ - PVOP_VCALL1(pv_lock_ops.spin_unlock, lock); + PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket); } #endif diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 0db1fca..d5deb6d 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -327,13 +327,11 @@ struct pv_mmu_ops { }; struct arch_spinlock; +#include asm/spinlock_types.h + struct pv_lock_ops { - int (*spin_is_locked)(struct arch_spinlock *lock); - int (*spin_is_contended)(struct arch_spinlock *lock); - void (*spin_lock)(struct arch_spinlock *lock); - void (*spin_lock_flags)(struct arch_spinlock *lock, unsigned long flags); - int (*spin_trylock)(struct arch_spinlock *lock); - void (*spin_unlock
Re: [PATCH 09/14] xen: events: Remove redundant check on unsigned variable
To be honest I'd nack this kind of patch. The test is only redundant in the most trivial sense that the compiler can easily optimise away. The point of the test is to make sure that the range is OK even if the type subsequently becomes signed (to hold a -ve error, for example). J Konrad Rzeszutek Wilk konrad.w...@oracle.com wrote: On Fri, Nov 16, 2012 at 12:20:41PM +0530, Tushar Behera wrote: No need to check whether unsigned variable is less than 0. CC: Konrad Rzeszutek Wilk konrad.w...@oracle.com Acked-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com CC: Jeremy Fitzhardinge jer...@goop.org CC: xen-de...@lists.xensource.com CC: virtualization@lists.linux-foundation.org Signed-off-by: Tushar Behera tushar.beh...@linaro.org --- drivers/xen/events.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/xen/events.c b/drivers/xen/events.c index 4293c57..cadd7d1 100644 --- a/drivers/xen/events.c +++ b/drivers/xen/events.c @@ -216,7 +216,7 @@ static void xen_irq_info_pirq_init(unsigned irq, */ static unsigned int evtchn_from_irq(unsigned irq) { -if (unlikely(WARN(irq 0 || irq = nr_irqs, Invalid irq %d!\n, irq))) +if (unlikely(WARN(irq = nr_irqs, Invalid irq %d!\n, irq))) return 0; return info_for_irq(irq)-evtchn; -- 1.7.4.1 -- Sent from my Android phone with K-9 Mail. Please excuse my brevity.___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/13/2012 11:45 AM, Raghavendra K T wrote: On 05/07/2012 08:22 PM, Avi Kivity wrote: I could not come with pv-flush results (also Nikunj had clarified that the result was on NOn PLE I'd like to see those numbers, then. Ingo, please hold on the kvm-specific patches, meanwhile. 3 guests 8GB RAM, 1 used for kernbench (kernbench -f -H -M -o 20) other for cpuhog (shell script with while true do hackbench) 1x: no hogs 2x: 8hogs in one guest 3x: 8hogs each in two guest kernbench on PLE: Machine : IBM xSeries with Intel(R) Xeon(R) X7560 2.27GHz CPU with 32 core, with 8 online cpus and 4*64GB RAM. The average is taken over 4 iterations with 3 run each (4*3=12). and stdev is calculated over mean reported in each run. A): 8 vcpu guest BASEBASE+patch %improvement w.r.t mean (sd) mean (sd) patched kernel time case 1*1x:61.7075 (1.17872)60.93 (1.475625)1.27605 case 1*2x:107.2125 (1.3821349)97.506675 (1.3461878) 9.95401 case 1*3x:144.3515 (1.8203927)138.9525 (0.58309319) 3.8855 B): 16 vcpu guest BASEBASE+patch %improvement w.r.t mean (sd) mean (sd) patched kernel time case 2*1x:70.524 (1.5941395)69.68866 (1.9392529) 1.19867 case 2*2x:133.0738 (1.4558653)124.8568 (1.4544986) 6.58114 case 2*3x:206.0094 (1.3437359)181.4712 (2.9134116) 13.5218 B): 32 vcpu guest BASEBASE+patch %improvementw.r.t mean (sd) mean (sd) patched kernel time case 4*1x:100.61046 (2.7603485) 85.48734 (2.6035035) 17.6905 What does the 4*1x notation mean? Do these workloads have overcommit of the PCPU resources? When I measured it, even quite small amounts of overcommit lead to large performance drops with non-pv ticket locks (on the order of 10% improvements when there were 5 busy VCPUs on a 4 cpu system). I never tested it on larger machines, but I guess that represents around 25% overcommit, or 40 busy VCPUs on a 32-PCPU system. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V8 0/17] Paravirtualized ticket spinlocks
On 05/07/2012 06:49 AM, Avi Kivity wrote: On 05/07/2012 04:46 PM, Srivatsa Vaddagiri wrote: * Raghavendra K T raghavendra...@linux.vnet.ibm.com [2012-05-07 19:08:51]: I 'll get hold of a PLE mc and come up with the numbers soon. but I 'll expect the improvement around 1-3% as it was in last version. Deferring preemption (when vcpu is holding lock) may give us better than 1-3% results on PLE hardware. Something worth trying IMHO. Is the improvement so low, because PLE is interfering with the patch, or because PLE already does a good job? How does PLE help with ticket scheduling on unlock? I thought it would just help with the actual spin loops. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] [PATCH RFC V6 0/11] Paravirtualized ticketlocks
On 04/16/2012 09:36 AM, Ian Campbell wrote: On Mon, 2012-04-16 at 16:44 +0100, Konrad Rzeszutek Wilk wrote: On Sat, Mar 31, 2012 at 09:37:45AM +0530, Srivatsa Vaddagiri wrote: * Thomas Gleixner t...@linutronix.de [2012-03-31 00:07:58]: I know that Peter is going to go berserk on me, but if we are running a paravirt guest then it's simple to provide a mechanism which allows the host (aka hypervisor) to check that in the guest just by looking at some global state. So if a guest exits due to an external event it's easy to inspect the state of that guest and avoid to schedule away when it was interrupted in a spinlock held section. That guest/host shared state needs to be modified to indicate the guest to invoke an exit when the last nested lock has been released. I had attempted something like that long back: http://lkml.org/lkml/2010/6/3/4 The issue is with ticketlocks though. VCPUs could go into a spin w/o a lock being held by anybody. Say VCPUs 1-99 try to grab a lock in that order (on a host with one cpu). VCPU1 wins (after VCPU0 releases it) and releases the lock. VCPU1 is next eligible to take the lock. If that is not scheduled early enough by host, then remaining vcpus would keep spinning (even though lock is technically not held by anybody) w/o making forward progress. In that situation, what we really need is for the guest to hint to host scheduler to schedule VCPU1 early (via yield_to or something similar). The current pv-spinlock patches however does not track which vcpu is spinning at what head of the ticketlock. I suppose we can consider that optimization in future and see how much benefit it provides (over plain yield/sleep the way its done now). Right. I think Jeremy played around with this some time? 5/11 xen/pvticketlock: Xen implementation for PV ticket locks tracks which vcpus are waiting for a lock in cpumask_t waiting_cpus and tracks which lock each is waiting for in per-cpu lock_waiting. This is used in xen_unlock_kick to kick the right CPU. There's a loop over only the waiting cpus to figure out who to kick. Yes, and AFAIK the KVM pv-ticketlock patches do the same thing. If a (V)CPU is asleep, then sending it a kick is pretty much equivalent to a yield to (not precisely, but it should get scheduled soon enough, and it won't be competing with a pile of VCPUs with no useful work to do). J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V4 4/5] kvm : pv-ticketlocks support for linux guests running on KVM hypervisor
On 01/17/2012 10:33 PM, Srivatsa Vaddagiri wrote: * Marcelo Tosatti mtosa...@redhat.com [2012-01-17 09:02:11]: +/* Kick vcpu waiting on @lock-head to reach value @ticket */ +static void kvm_unlock_kick(struct arch_spinlock *lock, __ticket_t ticket) +{ + int cpu; + int apicid; + + add_stats(RELEASED_SLOW, 1); + + for_each_cpu(cpu, waiting_cpus) { + const struct kvm_lock_waiting *w = per_cpu(lock_waiting, cpu); + if (ACCESS_ONCE(w-lock) == lock + ACCESS_ONCE(w-want) == ticket) { + add_stats(RELEASED_SLOW_KICKED, 1); + apicid = per_cpu(x86_cpu_to_apicid, cpu); + kvm_kick_cpu(apicid); + break; + } + } What prevents a kick from being lost here, if say, the waiter is at local_irq_save in kvm_lock_spinning, before the lock/want assignments? The waiter does check for lock becoming available before actually sleeping: + /* +* check again make sure it didn't become free while +* we weren't looking. +*/ + if (ACCESS_ONCE(lock-tickets.head) == want) { + add_stats(TAKEN_SLOW_PICKUP, 1); + goto out; + } That logic relies on the kick being level triggered, so that kick before block will cause the block to fall out immediately. If you're using hlt as the block and it has the usual edge-triggered behaviour, what stops a kick-before-hlt from losing the kick? J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V4 4/5] kvm : pv-ticketlocks support for linux guests running on KVM hypervisor
On 01/17/2012 01:47 AM, Avi Kivity wrote: On 01/16/2012 04:13 PM, Raghavendra K T wrote: Please drop all of these and replace with tracepoints in the appropriate spots. Everything else (including the historgram) can be reconstructed the tracepoints in userspace. I think Jeremy pointed that tracepoints use spinlocks and hence debugfs is the option.. no ? Yeah, I think you're right. What a pity. Yes, I went through the same thought process. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V4 0/5] kvm : Paravirt-spinlock support for KVM guests
On 01/16/2012 09:24 PM, Alexander Graf wrote: This is true in case you're spinning. If on overcommit spinlocks would instead of spin just yield(), we wouldn't have any vcpu running that's just waiting for a late ticket. Yes, but the reality is that most spinlocks are held for a short period of time and there's a low likelihood of being preempted while within a spinlock critical section. Therefore if someone else tries to get the spinlock and there's contention, it's always worth spinning for a little while because the lock will likely become free soon. At least that's the case if the lock has low contention (shallow queue depth and not in slow state). Again, maybe it makes sense to never spin for deep queues or already slowstate locks. We still have an issue finding the point in time when a vcpu could run again, which is what this whole series is about. My point above was that instead of doing a count loop, we could just do the normal spin dance and set the threshold to when we enable the magic to have another spin lock notify us in the CPU. That way we * don't change the uncontended case I don't follow you. What do you mean by the normal spin dance? What do you mean by have another spinlock notify us in the CPU? Don't change which uncontended case? Do you mean in the locking path? Or the unlock path? Or both? * can set the threshold on the host, which knows how contended the system is Hm, I'm not convinced that knowing how contended the system is is all that useful overall. What's important is how contended a particular lock is, and what state the current holder is in. If it's not currently running, then knowing the overall system contention would give you some idea about how long you need to wait for it to be rescheduled, but that's getting pretty indirect. I think the slowpath if preempted while spinning idea I mentioned in the other mail is probably worth following up, since that give specific actionable information to the guest from the hypervisor. But lots of caveats. [[ A possible mechanism: * register ranges of [er]ips with the hypervisor * each range is paired with a resched handler block * if vcpu is preempted within such a range, make sure it is rescheduled in the resched handler block This is obviously akin to the exception mechanism, but it is partially implemented by the hypervisor. It allows the spinlock code to be unchanged from native, but make use of a resched rather than an explicit counter to determine when to slowpath the lock. And it's a nice general mechanism that could be potentially useful elsewhere. Unfortunately, it doesn't change the unlock path at all; it still needs to explicitly test if a VCPU needs to be kicked on unlock. ]] And since we control what spin locks look like, we can for example always keep the pointer to it in a specific register so that we can handle pv_lock_ops.lock_spinning() inside there and fetch all the information we need from our pt_regs. You've left a pile of parts of an idea lying around, but I'm not sure what shape you intend it to be. Speaking of which - have you benchmarked performance degradation of pv ticket locks on bare metal? Last time I checked, enabling all the PV ops did incur significant slowdown which is why I went though the work to split the individual pv ops features up to only enable a few for KVM guests. The whole point of the pv-ticketlock work is to keep the pvops hooks out of the locking fast path, so that the calls are only made on the slow path - that is, when spinning too long on a contended lock, and when releasing a lock that's in a slow state. In the fast path case of no contention, there are no pvops, and the executed code path is almost identical to native. You're still changing a tight loop that does nothing (CPU detects it and saves power) into something that performs calculations. It still has a pause instruction in that loop, so that CPU mechanism will still come into play. pause doesn't directly save power; it's more about making sure that memory dependence cycles are broken and that two competing threads will make similar progress. Besides I'm not sure adding a dec+test to a loop that's already got a memory read and compare in it is adding much in the way of calculations. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH RFC V4 0/5] kvm : Paravirt-spinlock support for KVM guests
are only made on the slow path - that is, when spinning too long on a contended lock, and when releasing a lock that's in a slow state. In the fast path case of no contention, there are no pvops, and the executed code path is almost identical to native. But as I mentioned above, I'd like to see some benchmarks to prove that's the case. J Changes in V4: - reabsed to 3.2.0 pre. - use APIC ID for kicking the vcpu and use kvm_apic_match_dest for matching. (Avi) - fold vcpu-kicked flag into vcpu-requests (KVM_REQ_PVLOCK_KICK) and related changes for UNHALT path to make pv ticket spinlock migration friendly. (Avi, Marcello) - Added Documentation for CPUID, Hypercall (KVM_HC_KICK_CPU) and capabilty (KVM_CAP_PVLOCK_KICK) (Avi) - Remove unneeded kvm_arch_vcpu_ioctl_set_mpstate call. (Marcello) - cumulative variable type changed (int == u32) in add_stat (Konrad) - remove unneeded kvm_guest_init for !CONFIG_KVM_GUEST case Changes in V3: - rebased to 3.2-rc1 - use halt() instead of wait for kick hypercall. - modify kick hyper call to do wakeup halted vcpu. - hook kvm_spinlock_init to smp_prepare_cpus call (moved the call out of head##.c). - fix the potential race when zero_stat is read. - export debugfs_create_32 and add documentation to API. - use static inline and enum instead of ADDSTAT macro. - add barrier() in after setting kick_vcpu. - empty static inline function for kvm_spinlock_init. - combine the patches one and two readuce overhead. - make KVM_DEBUGFS depends on DEBUGFS. - include debugfs header unconditionally. Changes in V2: - rebased patchesto -rc9 - synchronization related changes based on Jeremy's changes (Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com) pointed by Stephan Diestelhorst stephan.diestelho...@amd.com - enabling 32 bit guests - splitted patches into two more chunks Srivatsa Vaddagiri, Suzuki Poulose, Raghavendra K T (5): Add debugfs support to print u32-arrays in debugfs Add a hypercall to KVM hypervisor to support pv-ticketlocks Added configuration support to enable debug information for KVM Guests pv-ticketlocks support for linux guests running on KVM hypervisor Add documentation on Hypercalls and features used for PV spinlock Test Set up : The BASE patch is pre 3.2.0 + Jeremy's following patches. xadd (https://lkml.org/lkml/2011/10/4/328) x86/ticketlocklock (https://lkml.org/lkml/2011/10/12/496). Kernel for host/guest : 3.2.0 + Jeremy's xadd, pv spinlock patches as BASE (Note:locked add change is not taken yet) Results: The performance gain is mainly because of reduced busy-wait time. From the results we can see that patched kernel performance is similar to BASE when there is no lock contention. But once we start seeing more contention, patched kernel outperforms BASE (non PLE). On PLE machine we do not see greater performance improvement because of PLE complimenting halt() 3 guests with 8VCPU, 4GB RAM, 1 used for kernbench (kernbench -f -H -M -o 20) other for cpuhog (shell script while true with an instruction) scenario A: unpinned 1x: no hogs 2x: 8hogs in one guest 3x: 8hogs each in two guest scenario B: unpinned, run kernbench on all the guests no hogs. Dbench on PLE machine: dbench run on all the guest simultaneously with dbench --warmup=30 -t 120 with NRCLIENTS=(8/16/32). Result for Non PLE machine : Machine : IBM xSeries with Intel(R) Xeon(R) x5570 2.93GHz CPU with 8 core , 64GB RAM BASEBASE+patch%improvement mean (sd) mean (sd) Scenario A: case 1x: 164.233 (16.5506) 163.584 (15.4598 0.39517 case 2x: 897.654 (543.993) 328.63 (103.771) 63.3901 case 3x: 2855.73 (2201.41) 315.029 (111.854) 88.9685 Dbench: Throughput is in MB/sec NRCLIENTS BASEBASE+patch%improvement mean (sd) mean (sd) 81.774307 (0.061361)1.725667 (0.034644)-2.74135 16 1.445967 (0.044805)1.463173 (0.094399)1.18993 32 2.136667 (0.105717)2.193792 (0.129357)2.67356 Result for PLE machine: == Machine : IBM xSeries with Intel(R) Xeon(R) X7560 2.27GHz CPU with 32/64 core, with 8 online cores and 4*64GB RAM Kernbench: BASEBASE+patch%improvement mean (sd) mean (sd) Scenario A: case 1x: 161.263 (56.518)159.635 (40.5621) 1.00953 case 2x: 190.748 (61.2745) 190.606 (54.4766) 0.0744438 case 3x: 227.378 (100.215) 225.442 (92.0809) 0.851446 Scenario B: 446.104 (58.54 )433.12733 (54.476) 2.91 Dbench: Throughput is in MB/sec NRCLIENTS BASEBASE+patch
Re: [Xen-devel] [PATCH] xen: remove CONFIG_XEN_DOM0 compile option
On 01/10/2012 05:44 AM, Stefano Stabellini wrote: On Mon, 9 Jan 2012, Andrew Jones wrote: I guess if we did the s/XEN_DOM0/LOCAL_APIC IO_APIC ACPI/ in arch/x86/pci/xen.c it would be pretty easy to review for equivalence. Then keep CONFIG_PRIVILIGED, but drop XEN_DOM0 from everywhere else and compile in the 3 files. I don't think it makes much sense to do it though. XEN_DOM0 keeps things tidier now and might be useful later. we can keep things clean with the following: #ifdef CONFIG_LOCAL_APIC CONFIG_IO_APIC CONFIG_ACPI CONFIG_PCI_XEN #define XEN_DOM0 #endif in include/xen/xen.h. So in the source files we can still '#ifdef XEN_DOM0', but at the same time we can get rid of the build symbol: everybody wins. No, really, I think this is silly. This kind of dependency information should be encoded in the Kconfig system, and not reimplemented in an ad-hoc way. If there were a clean way to do what Andrew wants then I'd support it, but this thread has descended into a spiral of madness, which makes me think its a lost cause. If the root complaint is that customers think that anything set in .config is a supported feature, then the solutions are to support all the features in .config, re-educate the customers that they're wrong, or maintain a local patch to do this stuff. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 0/6] Collected vdso/vsyscall fixes for 3.1
On 08/03/2011 06:59 AM, Andrew Lutomirski wrote: On Wed, Aug 3, 2011 at 9:56 AM, Konrad Rzeszutek Wilk konrad.w...@oracle.com wrote: On Wed, Aug 03, 2011 at 09:53:22AM -0400, Konrad Rzeszutek Wilk wrote: On Wed, Aug 03, 2011 at 09:31:48AM -0400, Andy Lutomirski wrote: This fixes various problems that cropped up with the vdso patches. - Patch 1 fixes an information leak to userspace. - Patches 2 and 3 fix the kernel build on gold. - Patches 4 and 5 fix Xen (I hope). - Patch 6 (optional) adds a trace event to vsyscall emulation. It will make it easier to handle performance regression reports :) Hm, you seemed to have the x86 maintainers on your email.. I definitly need some coffee. What I meant was that you missing putting the x86 maintainers on this patchset. They are the ones that will handle this. I put them on the To list for you. Are you sure about that coffee? I'm pretty sure I had x...@kernel.org in there. What's the state of this series? Has tip.git picked it up, or does it need another go-around? I just booted a Linus tree and got a backtrace that looks like this issue, though I haven't looked into it in detail yet. Thanks, J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] Re: linux-next: Tree for July 25 (xen)
On 08/08/2011 06:01 AM, Steven Rostedt wrote: On Sat, 2011-08-06 at 11:22 -0700, Jeremy Fitzhardinge wrote: On 08/05/2011 06:58 PM, Konrad Rzeszutek Wilk wrote: On Fri, Aug 05, 2011 at 11:14:23PM +0200, Ingo Molnar wrote: * Ingo Molnar mi...@elte.hu wrote: * Randy Dunlap rdun...@xenotime.net wrote: On 08/04/11 15:40, Konrad Rzeszutek Wilk wrote: On Thu, Aug 04, 2011 at 06:32:59PM -0400, Konrad Rzeszutek Wilk wrote: These build failures are still triggering upstream: arch/x86/xen/trace.c:44:2: error: array index in initializer not of integer type arch/x86/xen/trace.c:44:2: error: (near initialization for ‘xen_hypercall_names’) arch/x86/xen/trace.c:45:1: error: ‘__HYPERVISOR_arch_4’ undeclared here (not in a function) arch/x86/xen/trace.c:45:2: error: array index in initializer not of integer type arch/x86/xen/trace.c:45:2: error: (near initialization for ‘xen_hypercall_names’) Oh, that I haven't seen. Can you send me the .config for that please. You can't be trying very hard then. I see lots of these (but no, Ah, I am getting it now. Thanks for reporting it. This should do the trick: Acked-by: Randy Dunlap rdun...@xenotime.net That patch did the trick here too: Acked-by: Ingo Molnar mi...@elte.hu Except that i'm still seeing the occasional build failure - see the error log below. Config attached. Much appreciate for the report. I believe this fix (which I think hit linux-next yesterday?) should do that: commit 1e9ea2656b656edd3c8de98675bbc0340211b5bd Author: Jeremy Fitzhardinge jer...@goop.org Date: Wed Aug 3 09:43:44 2011 -0700 xen/tracing: it looks like we wanted CONFIG_FTRACE Apparently we wanted CONFIG_FTRACE rather the CONFIG_FUNCTION_TRACER. Reported-by: Sander Eikelenboom li...@eikelenboom.it Tested-by: Sander Eikelenboom li...@eikelenboom.it Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Signed-off-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile index 45e94ac..3326204 100644 --- a/arch/x86/xen/Makefile +++ b/arch/x86/xen/Makefile @@ -15,7 +15,7 @@ obj-y := enlighten.o setup.o multicalls.o mmu.o irq.o \ grant-table.o suspend.o platform-pci-unplug.o \ p2m.o -obj-$(CONFIG_FUNCTION_TRACER) += trace.o +obj-$(CONFIG_FTRACE) += trace.o I'm not sure this is correct either. Maybe it should be CONFIG_TRACEPOINTS? Steven? Actually, I believe the correct answer is: CONFIG_EVENT_TRACING OK, thanks. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] Re: linux-next: Tree for July 25 (xen)
On 08/05/2011 06:58 PM, Konrad Rzeszutek Wilk wrote: On Fri, Aug 05, 2011 at 11:14:23PM +0200, Ingo Molnar wrote: * Ingo Molnar mi...@elte.hu wrote: * Randy Dunlap rdun...@xenotime.net wrote: On 08/04/11 15:40, Konrad Rzeszutek Wilk wrote: On Thu, Aug 04, 2011 at 06:32:59PM -0400, Konrad Rzeszutek Wilk wrote: These build failures are still triggering upstream: arch/x86/xen/trace.c:44:2: error: array index in initializer not of integer type arch/x86/xen/trace.c:44:2: error: (near initialization for ‘xen_hypercall_names’) arch/x86/xen/trace.c:45:1: error: ‘__HYPERVISOR_arch_4’ undeclared here (not in a function) arch/x86/xen/trace.c:45:2: error: array index in initializer not of integer type arch/x86/xen/trace.c:45:2: error: (near initialization for ‘xen_hypercall_names’) Oh, that I haven't seen. Can you send me the .config for that please. You can't be trying very hard then. I see lots of these (but no, Ah, I am getting it now. Thanks for reporting it. This should do the trick: Acked-by: Randy Dunlap rdun...@xenotime.net That patch did the trick here too: Acked-by: Ingo Molnar mi...@elte.hu Except that i'm still seeing the occasional build failure - see the error log below. Config attached. Much appreciate for the report. I believe this fix (which I think hit linux-next yesterday?) should do that: commit 1e9ea2656b656edd3c8de98675bbc0340211b5bd Author: Jeremy Fitzhardinge jer...@goop.org Date: Wed Aug 3 09:43:44 2011 -0700 xen/tracing: it looks like we wanted CONFIG_FTRACE Apparently we wanted CONFIG_FTRACE rather the CONFIG_FUNCTION_TRACER. Reported-by: Sander Eikelenboom li...@eikelenboom.it Tested-by: Sander Eikelenboom li...@eikelenboom.it Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Signed-off-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile index 45e94ac..3326204 100644 --- a/arch/x86/xen/Makefile +++ b/arch/x86/xen/Makefile @@ -15,7 +15,7 @@ obj-y := enlighten.o setup.o multicalls.o mmu.o irq.o \ grant-table.o suspend.o platform-pci-unplug.o \ p2m.o -obj-$(CONFIG_FUNCTION_TRACER) += trace.o +obj-$(CONFIG_FTRACE) += trace.o I'm not sure this is correct either. Maybe it should be CONFIG_TRACEPOINTS? Steven? Thanks, J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 5/5] x86-64: Add user_64bit_mode paravirt op
On 07/26/2011 08:20 PM, Andy Lutomirski wrote: Three places in the kernel assume that the only long mode CPL 3 selector is __USER_CS. This is not true on Xen -- Xen's sysretq changes cs to the magic value 0xe033. Two of the places are corner cases, but as of x86-64: Improve vsyscall emulation CS and RIP handling (c9712944b2a12373cb6ff8059afcfb7e826a6c54), vsyscalls will segfault if called with Xen's extra CS selector. This causes a panic when older init builds die. It seems impossible to make Xen use __USER_CS reliably without taking a performance hit on every system call, so this fixes the tests instead with a new paravirt op. It's a little ugly because ptrace.h can't include paravirt.h. Signed-off-by: Andy Lutomirski l...@mit.edu Reported-by: Konrad Rzeszutek Wilk konrad.w...@oracle.com --- arch/x86/include/asm/desc.h |4 ++-- arch/x86/include/asm/paravirt_types.h |6 ++ arch/x86/include/asm/ptrace.h | 19 +++ arch/x86/kernel/paravirt.c|4 arch/x86/kernel/step.c|2 +- arch/x86/kernel/vsyscall_64.c |6 +- arch/x86/mm/fault.c |2 +- arch/x86/xen/enlighten.c |1 + 8 files changed, 35 insertions(+), 9 deletions(-) diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h index 7b439d9..41935fa 100644 --- a/arch/x86/include/asm/desc.h +++ b/arch/x86/include/asm/desc.h @@ -27,8 +27,8 @@ static inline void fill_ldt(struct desc_struct *desc, const struct user_desc *in desc-base2 = (info-base_addr 0xff00) 24; /* - * Don't allow setting of the lm bit. It is useless anyway - * because 64bit system calls require __USER_CS: + * Don't allow setting of the lm bit. It would confuse + * user_64bit_mode and would get overridden by sysret anyway. */ desc-l = 0; } diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 2c76521..8e8b9a4 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -41,6 +41,7 @@ #include asm/desc_defs.h #include asm/kmap_types.h +#include asm/pgtable_types.h struct page; struct thread_struct; @@ -63,6 +64,11 @@ struct paravirt_callee_save { struct pv_info { unsigned int kernel_rpl; int shared_kernel_pmd; + +#ifdef CONFIG_X86_64 + u16 extra_user_64bit_cs; /* __USER_CS if none */ +#endif + int paravirt_enabled; const char *name; }; diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h index 94e7618..3566454 100644 --- a/arch/x86/include/asm/ptrace.h +++ b/arch/x86/include/asm/ptrace.h @@ -131,6 +131,9 @@ struct pt_regs { #ifdef __KERNEL__ #include linux/init.h +#ifdef CONFIG_PARAVIRT +#include asm/paravirt_types.h +#endif struct cpuinfo_x86; struct task_struct; @@ -187,6 +190,22 @@ static inline int v8086_mode(struct pt_regs *regs) #endif } +#ifdef CONFIG_X86_64 +static inline bool user_64bit_mode(struct pt_regs *regs) +{ +#ifndef CONFIG_PARAVIRT + /* + * On non-paravirt systems, this is the only long mode CPL 3 + * selector. We do not allow long mode selectors in the LDT. + */ + return regs-cs == __USER_CS; +#else + /* Headers are too twisted for this to go in paravirt.h. */ + return regs-cs == __USER_CS || regs-cs == pv_info.extra_user_64bit_cs; Is this necessary because usermode may sometimes be on __USER_CS or sometimes on Xen's? Could we just commit to one or the other and make it a simple comparison? What if __USER_CS were a variable? J +#endif +} +#endif + /* * X86_32 CPUs don't save ss and esp if the CPU is already in kernel mode * when it traps. The previous stack will be directly underneath the saved diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 613a793..d90272e 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -307,6 +307,10 @@ struct pv_info pv_info = { .paravirt_enabled = 0, .kernel_rpl = 0, .shared_kernel_pmd = 1, /* Only used when CONFIG_X86_PAE is set */ + +#ifdef CONFIG_X86_64 + .extra_user_64bit_cs = __USER_CS, +#endif }; struct pv_init_ops pv_init_ops = { diff --git a/arch/x86/kernel/step.c b/arch/x86/kernel/step.c index 7977f0c..c346d11 100644 --- a/arch/x86/kernel/step.c +++ b/arch/x86/kernel/step.c @@ -74,7 +74,7 @@ static int is_setting_trap_flag(struct task_struct *child, struct pt_regs *regs) #ifdef CONFIG_X86_64 case 0x40 ... 0x4f: - if (regs-cs != __USER_CS) + if (!user_64bit_mode(regs)) /* 32-bit mode: register increment */ return 0; /* 64-bit mode: REX prefix */ diff --git
Re: [PATCH 09/11] xen/xenbus: use printk_ratelimited() instead of printk_ratelimit()
On 06/16/2011 05:14 AM, Manuel Zerpies wrote: Since printk_ratelimit() shouldn't be used anymore (see comment in include/linux/printk.h), replace it with printk_ratelimited() Looks OK to me, but please fix the indentation of the rest of the statement to match. Thanks, J Signed-off-by: Manuel Zerpies manuel.f.zerp...@ww.stud.uni-erlangen.de --- drivers/xen/xenbus/xenbus_xs.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/xen/xenbus/xenbus_xs.c b/drivers/xen/xenbus/xenbus_xs.c index 5534690..4055858 100644 --- a/drivers/xen/xenbus/xenbus_xs.c +++ b/drivers/xen/xenbus/xenbus_xs.c @@ -44,6 +44,7 @@ #include linux/rwsem.h #include linux/module.h #include linux/mutex.h +#include linux/ratelimit.h #include xen/xenbus.h #include xenbus_comms.h @@ -270,8 +271,7 @@ static void *xs_talkv(struct xenbus_transaction t, } if (msg.type != type) { - if (printk_ratelimit()) - printk(KERN_WARNING + `printk_ratelimited(KERN_WARNING XENBUS unexpected type [%d], expected [%d]\n, msg.type, type); kfree(ret); ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [patch] xen: off by one errors in multicalls.c
On 06/02/2011 09:45 PM, Dan Carpenter wrote: b-args[] has MC_ARGS elements, so the comparison here should be = instead of . Otherwise we read past the end of the array one space. Yeah, looks like a correct fix. Fortunately I don't think anything currently hits that path in practice, though there are some pending patches which will exercise it more. Thanks, J Signed-off-by: Dan Carpenter erro...@gmail.com --- This is a static checker patch and I haven't tested it. Please review carefully. diff --git a/arch/x86/xen/multicalls.c b/arch/x86/xen/multicalls.c index 8bff7e7..1b2b73f 100644 --- a/arch/x86/xen/multicalls.c +++ b/arch/x86/xen/multicalls.c @@ -189,10 +189,10 @@ struct multicall_space __xen_mc_entry(size_t args) unsigned argidx = roundup(b-argidx, sizeof(u64)); BUG_ON(preemptible()); - BUG_ON(b-argidx MC_ARGS); + BUG_ON(b-argidx = MC_ARGS); if (b-mcidx == MC_BATCH || - (argidx + args) MC_ARGS) { + (argidx + args) = MC_ARGS) { mc_stats_flush(b-mcidx == MC_BATCH ? FL_SLOTS : FL_ARGS); xen_mc_flush(); argidx = roundup(b-argidx, sizeof(u64)); @@ -206,7 +206,7 @@ struct multicall_space __xen_mc_entry(size_t args) ret.args = b-args[argidx]; b-argidx = argidx + args; - BUG_ON(b-argidx MC_ARGS); + BUG_ON(b-argidx = MC_ARGS); return ret; } @@ -216,7 +216,7 @@ struct multicall_space xen_mc_extend_args(unsigned long op, size_t size) struct multicall_space ret = { NULL, NULL }; BUG_ON(preemptible()); - BUG_ON(b-argidx MC_ARGS); + BUG_ON(b-argidx = MC_ARGS); if (b-mcidx == 0) return ret; @@ -224,14 +224,14 @@ struct multicall_space xen_mc_extend_args(unsigned long op, size_t size) if (b-entries[b-mcidx - 1].op != op) return ret; - if ((b-argidx + size) MC_ARGS) + if ((b-argidx + size) = MC_ARGS) return ret; ret.mc = b-entries[b-mcidx - 1]; ret.args = b-args[b-argidx]; b-argidx += size; - BUG_ON(b-argidx MC_ARGS); + BUG_ON(b-argidx = MC_ARGS); return ret; } ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] Re: [PATCH] xen: drop anti-dependency on X86_VISWS
On 04/09/2011 03:34 AM, Ian Campbell wrote: Actually it does - see the #ifndef CONFIG_X86_CMPXCHG section in asm/cmpxchg_32.h. Hm, OK. Still, I'm happiest with that dependency in case someone knobbles the cpu to exclude cmpxchg and breaks things. Dropping the TSC patch is sensible though? You mean dropping the TSC dependency? Yes, I think so. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] xen: drop anti-dependency on X86_VISWS
On 04/07/2011 11:38 PM, Ian Campbell wrote: Not really. The TSC register is a requirement, but that's going to be present on any CPU which can boot Xen. We don't need any of the kernel's TSC machinery though. So why the Kconfig dependency then? In principal a kernel compiled for a non-TSC processor (which meets the other requirements for Xen, such as PAE support) will run just fine under Xen on a newer piece of hardware. Not sure where it came from. It was probably never needed, or just added for some secondary effect we wanted. Is there any downside to this patch (is X86_CMPXCHG in the same sort of boat?) Only if we don't use cmpxchg in shared memory with other domains or the hypervisor. (I don't think it will dynamically switch between real and emulated cmpxchg depending on availability.) J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] Re: [PATCH] xen: drop anti-dependency on X86_VISWS
On 04/08/2011 08:42 AM, Jan Beulich wrote: On 08.04.11 at 17:25, Jeremy Fitzhardinge jer...@goop.org wrote: On 04/07/2011 11:38 PM, Ian Campbell wrote: Is there any downside to this patch (is X86_CMPXCHG in the same sort of boat?) Only if we don't use cmpxchg in shared memory with other domains or the hypervisor. (I don't think it will dynamically switch between real and emulated cmpxchg depending on availability.) Actually it does - see the #ifndef CONFIG_X86_CMPXCHG section in asm/cmpxchg_32.h. Hm, OK. Still, I'm happiest with that dependency in case someone knobbles the cpu to exclude cmpxchg and breaks things. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] xen: drop anti-dependency on X86_VISWS
On 04/06/2011 11:58 PM, Ian Campbell wrote: On Wed, 2011-04-06 at 22:45 +0100, David Miller wrote: From: Ian Campbell ian.campb...@eu.citrix.com Date: Mon, 4 Apr 2011 10:55:55 +0100 You mean the !X86_VISWS I presume? It doesn't make sense to me either. No, I think 32-bit x86 allmodconfig elides XEN because of it's X86_TSC dependency. TSC is a real dependency of the Xen interfaces. Not really. The TSC register is a requirement, but that's going to be present on any CPU which can boot Xen. We don't need any of the kernel's TSC machinery though. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/3] kvm hypervisor : Add hypercalls to support pv-ticketlock
On 01/22/2011 06:53 AM, Rik van Riel wrote: The main question that remains is whether the PV ticketlocks are a large enough improvement to also merge those. I expect they will be, and we'll see so in the benchmark numbers. The pathological worst-case of ticket locks in a virtual environment isn't very hard to hit, so I expect they'll make a huge difference there. On lightly loaded systems (little or no CPU overcommit) then they should be close to a no-op. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 13/14] x86/ticketlock: add slowpath logic
On 01/19/2011 08:23 AM, Srivatsa Vaddagiri wrote: On Mon, Jan 17, 2011 at 08:52:22PM +0530, Srivatsa Vaddagiri wrote: I think this is still racy .. Unlocker Locker test slowpath - false set slowpath flag test for lock pickup - fail block unlock unlock needs to happen first before testing slowpath? I have made that change for my KVM guest and it seems to be working well with that change .. Will cleanup and post my patches shortly Patch below fixes the race described above. You can fold this to your patch 13/14 if you agree this is in right direction. Signed-off-by: Srivatsa Vaddagiri va...@linux.vnet.ibm.com --- arch/x86/include/asm/spinlock.h |7 +++ arch/x86/kernel/paravirt-spinlocks.c | 22 +- 2 files changed, 8 insertions(+), 21 deletions(-) Index: linux-2.6.37/arch/x86/include/asm/spinlock.h === --- linux-2.6.37.orig/arch/x86/include/asm/spinlock.h +++ linux-2.6.37/arch/x86/include/asm/spinlock.h @@ -55,7 +55,7 @@ static __always_inline void __ticket_unl /* Only defined when CONFIG_PARAVIRT_SPINLOCKS defined, but may as * well leave the prototype always visible. */ -extern void __ticket_unlock_release_slowpath(struct arch_spinlock *lock); +extern void __ticket_unlock_slowpath(struct arch_spinlock *lock); #ifdef CONFIG_PARAVIRT_SPINLOCKS @@ -166,10 +166,9 @@ static __always_inline int arch_spin_try static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) { barrier(); /* prevent reordering out of locked region */ + __ticket_unlock_release(lock); if (unlikely(__ticket_in_slowpath(lock))) - __ticket_unlock_release_slowpath(lock); - else - __ticket_unlock_release(lock); + __ticket_unlock_slowpath(lock); barrier(); /* prevent reordering into locked region */ } Index: linux-2.6.37/arch/x86/kernel/paravirt-spinlocks.c === --- linux-2.6.37.orig/arch/x86/kernel/paravirt-spinlocks.c +++ linux-2.6.37/arch/x86/kernel/paravirt-spinlocks.c @@ -22,33 +22,21 @@ EXPORT_SYMBOL(pv_lock_ops); * bits. However, we need to be careful about this because someone * may just be entering as we leave, and enter the slowpath. */ -void __ticket_unlock_release_slowpath(struct arch_spinlock *lock) +void __ticket_unlock_slowpath(struct arch_spinlock *lock) { struct arch_spinlock old, new; BUILD_BUG_ON(((__ticket_t)NR_CPUS) != NR_CPUS); old = ACCESS_ONCE(*lock); - new = old; - new.tickets.head += TICKET_LOCK_INC; /* Clear the slowpath flag */ new.tickets.tail = ~TICKET_SLOWPATH_FLAG; + if (new.tickets.head == new.tickets.tail) + cmpxchg(lock-head_tail, old.head_tail, new.head_tail); - /* - * If there's currently people waiting or someone snuck in - * since we read the lock above, then do a normal unlock and - * kick. If we managed to unlock with no queued waiters, then - * we can clear the slowpath flag. - */ - if (new.tickets.head != new.tickets.tail || - cmpxchg(lock-head_tail, - old.head_tail, new.head_tail) != old.head_tail) { - /* still people waiting */ - __ticket_unlock_release(lock); - } - + /* Wake up an appropriate waiter */ __ticket_unlock_kick(lock, new.tickets.head); Does the __ticket_unlock_kick need to be unconditional? If the cmpxchg() fails, then it means someone else has come in and changed the lock under our feet, and presumably they'll do the kick as needed? Or is there the possibility that the kick will get lost? There's certainly no harm in doing a redundant kick, and I'd expect that case to be pretty rare... J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/3] kvm hypervisor : Add hypercalls to support pv-ticketlock
On 01/20/2011 03:42 AM, Srivatsa Vaddagiri wrote: On Wed, Jan 19, 2011 at 10:53:52AM -0800, Jeremy Fitzhardinge wrote: The reason for wanting this should be clear I guess, it allows PI. Well, if we can expand the spinlock to include an owner, then all this becomes moot... How so? Having an owner will not eliminate the need for pv-ticketlocks afaict. We still need a mechanism to reduce latency in scheduling the next thread-in-waiting, which is achieved by your patches. No, sorry, I should have been clearer. I meant that going to the effort of not increasing the lock size to record in slowpath state. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/3] kvm hypervisor : Add hypercalls to support pv-ticketlock
On 01/20/2011 03:59 AM, Srivatsa Vaddagiri wrote: At least in the Xen code, a current owner isn't very useful, because we need the current owner to kick the *next* owner to life at release time, which we can't do without some structure recording which ticket belongs to which cpu. If we had a yield-to [1] sort of interface _and_ information on which vcpu owns a lock, then lock-spinners can yield-to the owning vcpu, while the unlocking vcpu can yield-to the next-vcpu-in-waiting. Perhaps, but the core problem is how to find next-vcpu-in-waiting efficiently. Once you have that info, there's a number of things you can usefully do with it. The key here is not to sleep when waiting for locks (as implemented by current patch-series, which can put other VMs at an advantage by giving them more time than they are entitled to) Why? If a VCPU can't make progress because its waiting for some resource, then why not schedule something else instead? Presumably when the VCPU does become runnable, the scheduler will credit its previous blocked state and let it run in preference to something else. and also to ensure that lock-owner as well as the next-in-line lock-owner are not unduly made to wait for cpu. Is there a way we can dynamically expand the size of lock only upon contention to include additional information like owning vcpu? Have the lock point to a per-cpu area upon contention where additional details can be stored perhaps? As soon as you add a pointer to the lock, you're increasing its size. If we had a pointer in there already, then all of this would be moot. If auxiliary per-lock is uncommon, then using a hash keyed on lock address would be one way to do it. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/3] kvm hypervisor : Add hypercalls to support pv-ticketlock
On 01/19/2011 09:21 AM, Peter Zijlstra wrote: On Wed, 2011-01-19 at 22:42 +0530, Srivatsa Vaddagiri wrote: Add two hypercalls to KVM hypervisor to support pv-ticketlocks. KVM_HC_WAIT_FOR_KICK blocks the calling vcpu until another vcpu kicks it or it is woken up because of an event like interrupt. KVM_HC_KICK_CPU allows the calling vcpu to kick another vcpu. The presence of these hypercalls is indicated to guest via KVM_FEATURE_WAIT_FOR_KICK/KVM_CAP_WAIT_FOR_KICK. Qemu needs a corresponding patch to pass up the presence of this feature to guest via cpuid. Patch to qemu will be sent separately. I didn't really read the patch, and I totally forgot everything from when I looked at the Xen series, but does the Xen/KVM hypercall interface for this include the vcpu to await the kick from? My guess is not, since the ticket locks used don't know who the owner is, which is of course, sad. There are FIFO spinlock implementations that can do this though.. although I think they all have a bigger memory footprint. At least in the Xen code, a current owner isn't very useful, because we need the current owner to kick the *next* owner to life at release time, which we can't do without some structure recording which ticket belongs to which cpu. (A reminder: the big problem with ticket locks is not with the current owner getting preempted, but making sure the next VCPU gets scheduled quickly when the current one releases; without that all the waiting VCPUs burn the timeslices until the VCPU scheduler gets around to scheduling the actual next in line.) At present, the code needs to scan an array of percpu I am waiting on lock X with ticket Y structures to work out who's next. The search is somewhat optimised by keeping a cpuset of which CPUs are actually blocked on spinlocks, but its still going to scale badly with lots of CPUs. I haven't thought of a good way to improve on this; an obvious approach is to just add a pointer to the spinlock and hang an explicit linked list off it, but that's incompatible with wanting to avoid expanding the lock. You could have a table of auxiliary per-lock data hashed on the lock address, but its not clear to me that its an improvement on the array approach, especially given the synchronization issues of keeping that structure up to date (do we have a generic lockless hashtable implementation?). But perhaps its one of those things that makes sense at larger scales. The reason for wanting this should be clear I guess, it allows PI. Well, if we can expand the spinlock to include an owner, then all this becomes moot... J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 13/14] x86/ticketlock: add slowpath logic
On 01/19/2011 10:39 AM, Srivatsa Vaddagiri wrote: I have tested quite extensively with booting a 16-vcpu guest (on a 16-pcpu host) and running kernel compine (with 32-threads). Without this patch, I had difficulty booting/shutting-down successfully (it would hang mid-way). Sounds good. But I like to test with make -j 100-200 to really give things a workout. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] x86/pvclock-xen: zero last_value on resume
On 11/26/2010 02:40 AM, Ingo Molnar wrote: * Jeremy Fitzhardinge jer...@goop.org wrote: On 10/26/2010 09:59 AM, Jeremy Fitzhardinge wrote: If the guest domain has been suspend/resumed or migrated, then the system clock backing the pvclock clocksource may revert to a smaller value (ie, can be non-monotonic across the migration/save-restore). Make sure we zero last_value in that case so that the domain continues to see clock updates. Ping? Looks like this fell through the gaps. Does not apply cleanly here - mind resending the latest version? It rebased cleanly to 2.6.37-rc3. You can pull it from: git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git pvclock-resume Jeremy Fitzhardinge (1): x86/pvclock: zero last_value on resume arch/x86/include/asm/pvclock.h |1 + arch/x86/kernel/pvclock.c |5 + arch/x86/xen/time.c|2 ++ 3 files changed, 8 insertions(+), 0 deletions(-) Thanks, J Subject: [PATCH] x86/pvclock: zero last_value on resume If the guest domain has been suspend/resumed or migrated, then the system clock backing the pvclock clocksource may revert to a smaller value (ie, can be non-monotonic across the migration/save-restore). Make sure we zero last_value in that case so that the domain continues to see clock updates. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h index 7f7e577..31d84ac 100644 --- a/arch/x86/include/asm/pvclock.h +++ b/arch/x86/include/asm/pvclock.h @@ -11,6 +11,7 @@ unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src); void pvclock_read_wallclock(struct pvclock_wall_clock *wall, struct pvclock_vcpu_time_info *vcpu, struct timespec *ts); +void pvclock_resume(void); /* * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction, diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c index 008b91e..42eb330 100644 --- a/arch/x86/kernel/pvclock.c +++ b/arch/x86/kernel/pvclock.c @@ -83,6 +83,11 @@ unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src) static atomic64_t last_value = ATOMIC64_INIT(0); +void pvclock_resume(void) +{ + atomic64_set(last_value, 0); +} + cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src) { struct pvclock_shadow_time shadow; diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c index b2bb5aa..5da5e53 100644 --- a/arch/x86/xen/time.c +++ b/arch/x86/xen/time.c @@ -426,6 +426,8 @@ void xen_timer_resume(void) { int cpu; + pvclock_resume(); + if (xen_clockevent != xen_vcpuop_clockevent) return; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] x86/pvclock-xen: zero last_value on resume
On 10/26/2010 09:59 AM, Jeremy Fitzhardinge wrote: If the guest domain has been suspend/resumed or migrated, then the system clock backing the pvclock clocksource may revert to a smaller value (ie, can be non-monotonic across the migration/save-restore). Make sure we zero last_value in that case so that the domain continues to see clock updates. Ping? Looks like this fell through the gaps. J [ I don't know if kvm needs an analogous fix or not. ] Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Cc: Stable Kernel sta...@kernel.org Reported-by: Eelco Dolstra e.dols...@tudelft.nl Reported-by: Olivier Hanesse olivier.hane...@gmail.com Bisected-by: Cédric Schieli cschi...@gmail.com Tested-by: Cédric Schieli cschi...@gmail.com diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h index cd02f32..6226870 100644 --- a/arch/x86/include/asm/pvclock.h +++ b/arch/x86/include/asm/pvclock.h @@ -11,5 +11,6 @@ unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src); void pvclock_read_wallclock(struct pvclock_wall_clock *wall, struct pvclock_vcpu_time_info *vcpu, struct timespec *ts); +void pvclock_resume(void); #endif /* _ASM_X86_PVCLOCK_H */ diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c index 239427c..a4f07c1 100644 --- a/arch/x86/kernel/pvclock.c +++ b/arch/x86/kernel/pvclock.c @@ -120,6 +120,11 @@ unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src) static atomic64_t last_value = ATOMIC64_INIT(0); +void pvclock_resume(void) +{ + atomic64_set(last_value, 0); +} + cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src) { struct pvclock_shadow_time shadow; diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c index b2bb5aa..5da5e53 100644 --- a/arch/x86/xen/time.c +++ b/arch/x86/xen/time.c @@ -426,6 +426,8 @@ void xen_timer_resume(void) { int cpu; + pvclock_resume(); + if (xen_clockevent != xen_vcpuop_clockevent) return; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 09/14] xen/pvticketlock: Xen implementation for PV ticket locks
On 11/17/2010 12:11 AM, Jan Beulich wrote: On 16.11.10 at 22:08, Jeremy Fitzhardinge jer...@goop.org wrote: +static void xen_lock_spinning(struct arch_spinlock *lock, unsigned want) { -struct xen_spinlock *xl = (struct xen_spinlock *)lock; -struct xen_spinlock *prev; int irq = __get_cpu_var(lock_kicker_irq); -int ret; +struct xen_lock_waiting *w = __get_cpu_var(lock_waiting); +int cpu = smp_processor_id(); u64 start; /* If kicker interrupts not initialized yet, just spin */ if (irq == -1) -return 0; +return; start = spin_time_start(); -/* announce we're spinning */ -prev = spinning_lock(xl); +w-want = want; +w-lock = lock; + +/* This uses set_bit, which atomic and therefore a barrier */ +cpumask_set_cpu(cpu, waiting_cpus); Since you don't allow nesting, don't you need to disable interrupts before you touch per-CPU state? Yes, I think you're right - interrupts need to be disabled for the bulk of this function. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 13/14] x86/ticketlock: add slowpath logic
On 11/17/2010 12:31 AM, Jan Beulich wrote: On 16.11.10 at 22:08, Jeremy Fitzhardinge jer...@goop.org wrote: +static inline void __ticket_enter_slowpath(struct arch_spinlock *lock) +{ +if (sizeof(lock-tickets.tail) == sizeof(u8)) +asm (LOCK_PREFIX orb %1, %0 + : +m (lock-tickets.tail) + : i (TICKET_SLOWPATH_FLAG) : memory); +else +asm (LOCK_PREFIX orw %1, %0 + : +m (lock-tickets.tail) + : i (TICKET_SLOWPATH_FLAG) : memory); +} Came only now to mind: Here and elsewhere, did you try using %z0 to have gcc produce the opcode suffix character, rather than having these somewhat ugly if()-s? Actually in this case I'm pretty sure there's already a set bit function which will do the job. set_bit(), I guess, though it takes a bit number rather than a mask... But, yes, %z0 sounds interesting. Is it documented anywhere? I think I've tried to use it in the past and run into gcc bugs. Thanks, J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 13/14] x86/ticketlock: add slowpath logic
On 11/17/2010 12:52 AM, Jeremy Fitzhardinge wrote: But, yes, %z0 sounds interesting. Is it documented anywhere? I think I've tried to use it in the past and run into gcc bugs. This one: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39590 Should be OK in this case because there's no 64-bit values to be seen... J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 13/14] x86/ticketlock: add slowpath logic
On 11/17/2010 12:58 AM, Avi Kivity wrote: Actually in this case I'm pretty sure there's already a set bit function which will do the job. set_bit(), I guess, though it takes a bit number rather than a mask... set_bit() operates on a long, while the intel manuals recommend against operating on operands of different size, especially with locked operations. I think newer processors have more relaxed requirements, though. Despite its prototype, set_bit() is pretty specifically using orb for a the constant case, or bts otherwise (I don't know what size memory operation bts is considered to generate). J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 13/14] x86/ticketlock: add slowpath logic
On 11/17/2010 12:56 AM, Jeremy Fitzhardinge wrote: On 11/17/2010 12:52 AM, Jeremy Fitzhardinge wrote: But, yes, %z0 sounds interesting. Is it documented anywhere? I think I've tried to use it in the past and run into gcc bugs. This one: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39590 Should be OK in this case because there's no 64-bit values to be seen... Hm, it fails when __ticket_t is 16 bits: /home/jeremy/git/linux/arch/x86/include/asm/spinlock.h: Assembler messages: /home/jeremy/git/linux/arch/x86/include/asm/spinlock.h:73: Error: suffix or operands invalid for `or' lock; ors $1, 2(%rbx) #, So I don't think that's going to work out... J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] Re: [PATCH 09/14] xen/pvticketlock: Xen implementation for PV ticket locks
On 11/17/2010 12:52 AM, Jeremy Fitzhardinge wrote: On 11/17/2010 12:11 AM, Jan Beulich wrote: On 16.11.10 at 22:08, Jeremy Fitzhardinge jer...@goop.org wrote: +static void xen_lock_spinning(struct arch_spinlock *lock, unsigned want) { - struct xen_spinlock *xl = (struct xen_spinlock *)lock; - struct xen_spinlock *prev; int irq = __get_cpu_var(lock_kicker_irq); - int ret; + struct xen_lock_waiting *w = __get_cpu_var(lock_waiting); + int cpu = smp_processor_id(); u64 start; /* If kicker interrupts not initialized yet, just spin */ if (irq == -1) - return 0; + return; start = spin_time_start(); - /* announce we're spinning */ - prev = spinning_lock(xl); + w-want = want; + w-lock = lock; + + /* This uses set_bit, which atomic and therefore a barrier */ + cpumask_set_cpu(cpu, waiting_cpus); Since you don't allow nesting, don't you need to disable interrupts before you touch per-CPU state? Yes, I think you're right - interrupts need to be disabled for the bulk of this function. Actually, on second thoughts, maybe it doesn't matter so much. The main issue is making sure that the interrupt will make the VCPU drop out of xen_poll_irq() - if it happens before xen_poll_irq(), it should leave the event pending, which will cause the poll to return immediately. I hope. Certainly disabling interrupts for some of the function will make it easier to analyze with respect to interrupt nesting. Another issue may be making sure the writes and reads of w-want and w-lock are ordered properly to make sure that xen_unlock_kick() never sees an inconsistent view of the (lock,want) tuple. The risk being that xen_unlock_kick() sees a random, spurious (lock,want) pairing and sends the kick event to the wrong VCPU, leaving the deserving one hung. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] Re: [PATCH 13/14] x86/ticketlock: add slowpath logic
On 11/17/2010 04:21 AM, Peter Zijlstra wrote: On Tue, 2010-11-16 at 13:08 -0800, Jeremy Fitzhardinge wrote: Maintain a flag in both LSBs of the ticket lock which indicates whether anyone is in the lock slowpath and may need kicking when the current holder unlocks. The flags are set when the first locker enters the slowpath, and cleared when unlocking to an empty queue. So here you say you set both LSBs in order to keep head == tail working, but the code seems to suggest you only use the tail LSB. I think I see why using only one LSB is sufficient, but some consistency would be nice :-) I tried that initially, but it turned out more messy. The problem is that the flag can change while you're spinning on the lock, so you need to mask it out every time you read tail before you can compare it to head; if head has the flag set too, you just need to mask it out of there as well: ticket = xadd(lock, 2 TICKET_SHIFT); ticket.tail = ~TICKET_SLOW_FLAGS; while (ticket.head != ticket.tail) { relax(); ticket.head = lock-head /* ~TICKET_SLOW_FLAGS */; } IOW setting both doesn't help anything, and just requires an extra mask in the spin loop (and anywhere else that uses 'head'). And hey, extra bit. Bound to be useful for something. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] Re: [PATCH 09/14] xen/pvticketlock: Xen implementation for PV ticket locks
On 11/17/2010 02:34 AM, Jan Beulich wrote: Actually, on second thoughts, maybe it doesn't matter so much. The main issue is making sure that the interrupt will make the VCPU drop out of xen_poll_irq() - if it happens before xen_poll_irq(), it should leave the event pending, which will cause the poll to return immediately. I hope. Certainly disabling interrupts for some of the function will make it easier to analyze with respect to interrupt nesting. That's not my main concern. Instead, what if you get interrupted anywhere here, the interrupt handler tries to acquire another spinlock and also has to go into the slow path? It'll overwrite part or all of the outer context's state. That doesn't matter if the outer context doesn't end up blocking. If it has already blocked then it will unblock as a result of the interrupt; if it hasn't yet blocked, then the inner context will leave the event pending and cause it to not block. Either way, it no longer uses or needs that per-cpu state: it will return to the spin loop and (maybe) get re-entered, setting it all up again. I think there is a problem with the code as posted because it sets up the percpu data before clearing the pending event, so it can end up blocking with bad percpu data. Another issue may be making sure the writes and reads of w-want and w-lock are ordered properly to make sure that xen_unlock_kick() never sees an inconsistent view of the (lock,want) tuple. The risk being that xen_unlock_kick() sees a random, spurious (lock,want) pairing and sends the kick event to the wrong VCPU, leaving the deserving one hung. Yes, proper operation sequence (and barriers) is certainly required here. If you allowed nesting, this may even become simpler (as you'd have a single write making visible the new head pointer, after having written all relevant fields of the new head structure). Yes, simple nesting should be quite straightforward (ie allowing an interrupt handler to take some other lock than the one the outer context is waiting on). J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 03/14] x86/ticketlock: Use C for __ticket_spin_unlock
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com If we don't need to use a locked inc for unlock, then implement it in C. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/spinlock.h | 32 +--- 1 files changed, 17 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index f48a6e3..0170ba9 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -33,9 +33,21 @@ * On PPro SMP or if we are using OOSTORE, we use a locked operation to unlock * (PPro errata 66, 92) */ -# define UNLOCK_LOCK_PREFIX LOCK_PREFIX +static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock) +{ + if (sizeof(lock-tickets.head) == sizeof(u8)) + asm (LOCK_PREFIX incb %0 +: +m (lock-tickets.head) : : memory); + else + asm (LOCK_PREFIX incw %0 +: +m (lock-tickets.head) : : memory); + +} #else -# define UNLOCK_LOCK_PREFIX +static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock) +{ + lock-tickets.head++; +} #endif /* @@ -93,14 +105,6 @@ static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) return tmp; } - -static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock) -{ - asm volatile(UNLOCK_LOCK_PREFIX incb %0 -: +m (lock-slock) -: -: memory, cc); -} #else static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) { @@ -144,15 +148,13 @@ static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) return tmp; } +#endif static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock) { - asm volatile(UNLOCK_LOCK_PREFIX incw %0 -: +m (lock-slock) -: -: memory, cc); + __ticket_unlock_release(lock); + barrier(); /* prevent reordering into locked region */ } -#endif static inline int __ticket_spin_is_locked(arch_spinlock_t *lock) { -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 02/14] x86/ticketlock: convert spin loop to C
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com The inner loop of __ticket_spin_lock isn't doing anything very special, so reimplement it in C. For the 8 bit ticket lock variant, we use a register union to get direct access to the lower and upper bytes in the tickets, but unfortunately gcc won't generate a direct comparison between the two halves of the register, so the generated asm isn't quite as pretty as the hand-coded version. However benchmarking shows that this is actually a small improvement in runtime performance on some benchmarks, and never a slowdown. We also need to make sure there's a barrier at the end of the lock loop to make sure that the compiler doesn't move any instructions from within the locked region into the region where we don't yet own the lock. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/spinlock.h | 58 +++--- 1 files changed, 29 insertions(+), 29 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index d6d5784..f48a6e3 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -58,21 +58,21 @@ #if (NR_CPUS 256) static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) { - unsigned short inc = 1 TICKET_SHIFT; - - asm volatile ( - LOCK_PREFIX xaddw %w0, %1\n - 1:\t - cmpb %h0, %b0\n\t - je 2f\n\t - rep ; nop\n\t - movb %1, %b0\n\t - /* don't need lfence here, because loads are in-order */ - jmp 1b\n - 2: - : +Q (inc), +m (lock-slock) - : - : memory, cc); + register union { + struct __raw_tickets tickets; + unsigned short slock; + } inc = { .slock = 1 TICKET_SHIFT }; + + asm volatile (LOCK_PREFIX xaddw %w0, %1\n + : +Q (inc), +m (lock-slock) : : memory, cc); + + for (;;) { + if (inc.tickets.head == inc.tickets.tail) + goto out; + cpu_relax(); + inc.tickets.head = ACCESS_ONCE(lock-tickets.head); + } +out: barrier(); /* make sure nothing creeps before the lock is taken */ } static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) @@ -105,22 +105,22 @@ static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock) static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) { unsigned inc = 1 TICKET_SHIFT; - unsigned tmp; + __ticket_t tmp; - asm volatile(LOCK_PREFIX xaddl %0, %1\n -movzwl %w0, %2\n\t -shrl $16, %0\n\t -1:\t -cmpl %0, %2\n\t -je 2f\n\t -rep ; nop\n\t -movzwl %1, %2\n\t -/* don't need lfence here, because loads are in-order */ -jmp 1b\n -2: -: +r (inc), +m (lock-slock), =r (tmp) -: -: memory, cc); + asm volatile(LOCK_PREFIX xaddl %0, %1\n\t +: +r (inc), +m (lock-slock) +: : memory, cc); + + tmp = inc; + inc = TICKET_SHIFT; + + for (;;) { + if ((__ticket_t)inc == tmp) + goto out; + cpu_relax(); + tmp = ACCESS_ONCE(lock-tickets.head); + } +out: barrier(); /* make sure nothing creeps before the lock is taken */ } static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 04/14] x86/ticketlock: make large and small ticket versions of spin_lock the same
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Make the bulk of __ticket_spin_lock look identical for large and small number of cpus. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/spinlock.h | 23 --- 1 files changed, 8 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 0170ba9..1b81809 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -70,19 +70,16 @@ static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock) #if (NR_CPUS 256) static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) { - register union { - struct __raw_tickets tickets; - unsigned short slock; - } inc = { .slock = 1 TICKET_SHIFT }; + register struct __raw_tickets inc = { .tail = 1 }; asm volatile (LOCK_PREFIX xaddw %w0, %1\n - : +Q (inc), +m (lock-slock) : : memory, cc); + : +r (inc), +m (lock-tickets) : : memory, cc); for (;;) { - if (inc.tickets.head == inc.tickets.tail) + if (inc.head == inc.tail) goto out; cpu_relax(); - inc.tickets.head = ACCESS_ONCE(lock-tickets.head); + inc.head = ACCESS_ONCE(lock-tickets.head); } out: barrier(); /* make sure nothing creeps before the lock is taken */ } @@ -108,21 +105,17 @@ static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) #else static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) { - unsigned inc = 1 TICKET_SHIFT; - __ticket_t tmp; + register struct __raw_tickets inc = { .tail = 1 }; asm volatile(LOCK_PREFIX xaddl %0, %1\n\t -: +r (inc), +m (lock-slock) +: +r (inc), +m (lock-tickets) : : memory, cc); - tmp = inc; - inc = TICKET_SHIFT; - for (;;) { - if ((__ticket_t)inc == tmp) + if (inc.head == inc.tail) goto out; cpu_relax(); - tmp = ACCESS_ONCE(lock-tickets.head); + inc.head = ACCESS_ONCE(lock-tickets.head); } out: barrier(); /* make sure nothing creeps before the lock is taken */ } -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 00/14] PV ticket locks without expanding spinlock
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Hi all, This is a revised version of the pvticket lock series. The early part of the series is mostly unchanged: it converts the bulk of the ticket lock code into C and makes the small and large ticket code common. The only changes are the incorporation of various review comments. The latter part of the series converts from pv spinlocks to pv ticket locks (ie, using the ticket lock fastpath as-is, but adding pv ops for the ticketlock slowpaths). The significant difference here is that rather than adding a new ticket_t-sized element to arch_spinlock_t - effectively doubling the size - I steal the LSB of the tickets themselves to store a bit. This allows the structure to remain the same size, but at the cost of halving the max number of CPUs (127 for a 8-bit ticket, and a hard max of 32767 overall). The extra bit (well, two, but one is unused) in indicates whether the lock has gone into slowpath state, which means one of its lockers has entered its slowpath and has blocked in the hypervisor. This means the current lock-holder needs to make sure it gets kicked out of the hypervisor on unlock. The spinlock remains in slowpath state until the last unlock happens (ie there are no more queued lockers). This code survives for a while with moderate testing, (make -j 100 on 8 VCPUs on a 4 PCPU system), but locks up after about 20 iterations, so there's still some race/deadlock in there (probably something misordered), but I think the basic approach is sound. Thanks, J Jeremy Fitzhardinge (14): x86/ticketlock: clean up types and accessors x86/ticketlock: convert spin loop to C x86/ticketlock: Use C for __ticket_spin_unlock x86/ticketlock: make large and small ticket versions of spin_lock the same x86/ticketlock: make __ticket_spin_lock common x86/ticketlock: make __ticket_spin_trylock common x86/spinlocks: replace pv spinlocks with pv ticketlocks x86/ticketlock: collapse a layer of functions xen/pvticketlock: Xen implementation for PV ticket locks x86/pvticketlock: use callee-save for lock_spinning x86/ticketlock: don't inline _spin_unlock when using paravirt spinlocks x86/ticketlocks: when paravirtualizing ticket locks, increment by 2 x86/ticketlock: add slowpath logic x86/ticketlocks: tidy up __ticket_unlock_kick() arch/x86/Kconfig |3 + arch/x86/include/asm/paravirt.h | 30 +--- arch/x86/include/asm/paravirt_types.h |8 +- arch/x86/include/asm/spinlock.h | 236 +--- arch/x86/include/asm/spinlock_types.h | 32 - arch/x86/kernel/paravirt-spinlocks.c | 52 +-- arch/x86/xen/spinlock.c | 281 + kernel/Kconfig.locks |2 +- 8 files changed, 231 insertions(+), 413 deletions(-) -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 08/14] x86/ticketlock: collapse a layer of functions
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Now that the paravirtualization layer doesn't exist at the spinlock level any more, we can collapse the __ticket_ functions into the arch_ functions. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/spinlock.h | 35 +-- 1 files changed, 5 insertions(+), 30 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 8e379d3..cfa80b5 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -108,7 +108,7 @@ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock, __t ticket_unlock_kick(lock, next); } -static __always_inline void __ticket_spin_lock(struct arch_spinlock *lock) +static __always_inline void arch_spin_lock(struct arch_spinlock *lock) { register struct __raw_tickets inc; @@ -128,7 +128,7 @@ static __always_inline void __ticket_spin_lock(struct arch_spinlock *lock) out: barrier(); /* make sure nothing creeps before the lock is taken */ } -static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) +static __always_inline int arch_spin_trylock(arch_spinlock_t *lock) { arch_spinlock_t old, new; @@ -142,7 +142,7 @@ static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) return cmpxchg(lock-head_tail, old.head_tail, new.head_tail) == old.head_tail; } -static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock) +static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) { __ticket_t next = lock-tickets.head + 1; __ticket_unlock_release(lock); @@ -150,46 +150,21 @@ static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock) barrier(); /* prevent reordering into locked region */ } -static inline int __ticket_spin_is_locked(arch_spinlock_t *lock) +static inline int arch_spin_is_locked(arch_spinlock_t *lock) { struct __raw_tickets tmp = ACCESS_ONCE(lock-tickets); return !!(tmp.tail ^ tmp.head); } -static inline int __ticket_spin_is_contended(arch_spinlock_t *lock) +static inline int arch_spin_is_contended(arch_spinlock_t *lock) { struct __raw_tickets tmp = ACCESS_ONCE(lock-tickets); return ((tmp.tail - tmp.head) TICKET_MASK) 1; } - -static inline int arch_spin_is_locked(arch_spinlock_t *lock) -{ - return __ticket_spin_is_locked(lock); -} - -static inline int arch_spin_is_contended(arch_spinlock_t *lock) -{ - return __ticket_spin_is_contended(lock); -} #define arch_spin_is_contended arch_spin_is_contended -static __always_inline void arch_spin_lock(arch_spinlock_t *lock) -{ - __ticket_spin_lock(lock); -} - -static __always_inline int arch_spin_trylock(arch_spinlock_t *lock) -{ - return __ticket_spin_trylock(lock); -} - -static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) -{ - __ticket_spin_unlock(lock); -} - static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags) { -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 07/14] x86/spinlocks: replace pv spinlocks with pv ticketlocks
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Rather than outright replacing the entire spinlock implementation in order to paravirtualize it, keep the ticket lock implementation but add a couple of pvops hooks on the slow patch (long spin on lock, unlocking a contended lock). Ticket locks have a number of nice properties, but they also have some surprising behaviours in virtual environments. They enforce a strict FIFO ordering on cpus trying to take a lock; however, if the hypervisor scheduler does not schedule the cpus in the correct order, the system can waste a huge amount of time spinning until the next cpu can take the lock. (See Thomas Friebel's talk Prevent Guests from Spinning Around http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.) To address this, we add two hooks: - __ticket_spin_lock which is called after the cpu has been spinning on the lock for a significant number of iterations but has failed to take the lock (presumably because the cpu holding the lock has been descheduled). The lock_spinning pvop is expected to block the cpu until it has been kicked by the current lock holder. - __ticket_spin_unlock, which on releasing a contended lock (there are more cpus with tail tickets), it looks to see if the next cpu is blocked and wakes it if so. When compiled with CONFIG_PARAVIRT_SPINLOCKS disabled, a set of stub functions causes all the extra code to go away. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/paravirt.h | 30 +++--- arch/x86/include/asm/paravirt_types.h |8 + arch/x86/include/asm/spinlock.h | 44 +++-- arch/x86/kernel/paravirt-spinlocks.c | 15 +- arch/x86/xen/spinlock.c |7 - 5 files changed, 50 insertions(+), 54 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 18e3b8a..c864775 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -717,36 +717,14 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx, #if defined(CONFIG_SMP) defined(CONFIG_PARAVIRT_SPINLOCKS) -static inline int arch_spin_is_locked(struct arch_spinlock *lock) +static inline void __ticket_lock_spinning(struct arch_spinlock *lock, unsigned ticket) { - return PVOP_CALL1(int, pv_lock_ops.spin_is_locked, lock); + PVOP_VCALL2(pv_lock_ops.lock_spinning, lock, ticket); } -static inline int arch_spin_is_contended(struct arch_spinlock *lock) +static inline void ticket_unlock_kick(struct arch_spinlock *lock, unsigned ticket) { - return PVOP_CALL1(int, pv_lock_ops.spin_is_contended, lock); -} -#define arch_spin_is_contended arch_spin_is_contended - -static __always_inline void arch_spin_lock(struct arch_spinlock *lock) -{ - PVOP_VCALL1(pv_lock_ops.spin_lock, lock); -} - -static __always_inline void arch_spin_lock_flags(struct arch_spinlock *lock, - unsigned long flags) -{ - PVOP_VCALL2(pv_lock_ops.spin_lock_flags, lock, flags); -} - -static __always_inline int arch_spin_trylock(struct arch_spinlock *lock) -{ - return PVOP_CALL1(int, pv_lock_ops.spin_trylock, lock); -} - -static __always_inline void arch_spin_unlock(struct arch_spinlock *lock) -{ - PVOP_VCALL1(pv_lock_ops.spin_unlock, lock); + PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket); } #endif diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index b82bac9..1078474 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -315,12 +315,8 @@ struct pv_mmu_ops { struct arch_spinlock; struct pv_lock_ops { - int (*spin_is_locked)(struct arch_spinlock *lock); - int (*spin_is_contended)(struct arch_spinlock *lock); - void (*spin_lock)(struct arch_spinlock *lock); - void (*spin_lock_flags)(struct arch_spinlock *lock, unsigned long flags); - int (*spin_trylock)(struct arch_spinlock *lock); - void (*spin_unlock)(struct arch_spinlock *lock); + void (*lock_spinning)(struct arch_spinlock *lock, unsigned ticket); + void (*unlock_kick)(struct arch_spinlock *lock, unsigned ticket); }; /* This contains all the paravirt structures: we get a convenient diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 3afb1a7..8e379d3 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -50,6 +50,21 @@ static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock) } #endif +/* How long a lock should spin before we consider blocking */ +#define SPIN_THRESHOLD (1 11) + +#ifndef CONFIG_PARAVIRT_SPINLOCKS + +static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock, unsigned ticket) +{ +} + +static __always_inline void ticket_unlock_kick
[PATCH 13/14] x86/ticketlock: add slowpath logic
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Maintain a flag in both LSBs of the ticket lock which indicates whether anyone is in the lock slowpath and may need kicking when the current holder unlocks. The flags are set when the first locker enters the slowpath, and cleared when unlocking to an empty queue. In the specific implementation of lock_spinning(), make sure to set the slowpath flags on the lock just before blocking. We must do this before the last-chance pickup test to prevent a deadlock with the unlocker: UnlockerLocker test for lock pickup - fail test slowpath + unlock - false set slowpath flags block Whereas this works in any ordering: UnlockerLocker set slowpath flags test for lock pickup - fail block test slowpath + unlock - true, kick Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/spinlock.h | 53 - arch/x86/include/asm/spinlock_types.h |2 + arch/x86/kernel/paravirt-spinlocks.c | 37 +++ arch/x86/xen/spinlock.c |4 ++ 4 files changed, 88 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 9e1c7ce..8d1cb42 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -53,7 +53,38 @@ static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock) /* How long a lock should spin before we consider blocking */ #define SPIN_THRESHOLD (1 11) -#ifndef CONFIG_PARAVIRT_SPINLOCKS +/* Only defined when CONFIG_PARAVIRT_SPINLOCKS defined, but may as + * well leave the prototype always visible. */ +extern void __ticket_unlock_release_slowpath(struct arch_spinlock *lock); + +#ifdef CONFIG_PARAVIRT_SPINLOCKS + +/* + * Return true if someone is in the slowpath on this lock. This + * should only be used by the current lock-holder. + */ +static inline bool __ticket_in_slowpath(struct arch_spinlock *lock) +{ + return !!(lock-tickets.tail TICKET_SLOWPATH_FLAG); +} + +static inline void __ticket_enter_slowpath(struct arch_spinlock *lock) +{ + if (sizeof(lock-tickets.tail) == sizeof(u8)) + asm (LOCK_PREFIX orb %1, %0 +: +m (lock-tickets.tail) +: i (TICKET_SLOWPATH_FLAG) : memory); + else + asm (LOCK_PREFIX orw %1, %0 +: +m (lock-tickets.tail) +: i (TICKET_SLOWPATH_FLAG) : memory); +} + +#else /* !CONFIG_PARAVIRT_SPINLOCKS */ +static inline bool __ticket_in_slowpath(struct arch_spinlock *lock) +{ + return false; +} static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock, unsigned ticket) { @@ -84,18 +115,22 @@ static __always_inline void ticket_unlock_kick(struct arch_spinlock *lock, u */ static __always_inline struct __raw_tickets __ticket_spin_claim(struct arch_spinlock *lock) { - register struct __raw_tickets tickets = { .tail = TICKET_LOCK_INC }; + register struct arch_spinlock tickets = { + { .tickets.tail = TICKET_LOCK_INC } + }; if (sizeof(lock-tickets.head) == sizeof(u8)) asm volatile (LOCK_PREFIX xaddw %w0, %1\n - : +r (tickets), +m (lock-tickets) + : +r (tickets.head_tail), +m (lock-tickets) : : memory, cc); else asm volatile (LOCK_PREFIX xaddl %0, %1\n -: +r (tickets), +m (lock-tickets) +: +r (tickets.head_tail), +m (lock-tickets) : : memory, cc); - return tickets; + tickets.tickets.tail = ~TICKET_SLOWPATH_FLAG; + + return tickets.tickets; } /* @@ -144,9 +179,11 @@ static __always_inline int arch_spin_trylock(arch_spinlock_t *lock) static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) { - __ticket_t next = lock-tickets.head + TICKET_LOCK_INC; - __ticket_unlock_release(lock); - __ticket_unlock_kick(lock, next); + barrier(); /* prevent reordering out of locked region */ + if (unlikely(__ticket_in_slowpath(lock))) + __ticket_unlock_release_slowpath(lock); + else + __ticket_unlock_release(lock); barrier(); /* prevent reordering into locked region */ } diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h index 0553c0b..7b383e2 100644 --- a/arch/x86/include/asm/spinlock_types.h +++ b/arch/x86/include/asm/spinlock_types.h
[PATCH 05/14] x86/ticketlock: make __ticket_spin_lock common
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Aside from the particular form of the xadd instruction, they're identical. So factor out the xadd and use common code for the rest. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/spinlock.h | 42 ++ 1 files changed, 20 insertions(+), 22 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 1b81809..f722f96 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -67,13 +67,27 @@ static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock) * save some instructions and make the code more elegant. There really isn't * much between them in performance though, especially as locks are out of line. */ -#if (NR_CPUS 256) -static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) +static __always_inline struct __raw_tickets __ticket_spin_claim(struct arch_spinlock *lock) { - register struct __raw_tickets inc = { .tail = 1 }; + register struct __raw_tickets tickets = { .tail = 1 }; + + if (sizeof(lock-tickets.head) == sizeof(u8)) + asm volatile (LOCK_PREFIX xaddw %w0, %1\n + : +r (tickets), +m (lock-tickets) + : : memory, cc); + else + asm volatile (LOCK_PREFIX xaddl %0, %1\n +: +r (tickets), +m (lock-tickets) +: : memory, cc); - asm volatile (LOCK_PREFIX xaddw %w0, %1\n - : +r (inc), +m (lock-tickets) : : memory, cc); + return tickets; +} + +static __always_inline void __ticket_spin_lock(struct arch_spinlock *lock) +{ + register struct __raw_tickets inc; + + inc = __ticket_spin_claim(lock); for (;;) { if (inc.head == inc.tail) @@ -84,6 +98,7 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) out: barrier(); /* make sure nothing creeps before the lock is taken */ } +#if (NR_CPUS 256) static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) { unsigned int tmp, new; @@ -103,23 +118,6 @@ static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) return tmp; } #else -static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) -{ - register struct __raw_tickets inc = { .tail = 1 }; - - asm volatile(LOCK_PREFIX xaddl %0, %1\n\t -: +r (inc), +m (lock-tickets) -: : memory, cc); - - for (;;) { - if (inc.head == inc.tail) - goto out; - cpu_relax(); - inc.head = ACCESS_ONCE(lock-tickets.head); - } -out: barrier(); /* make sure nothing creeps before the lock is taken */ -} - static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) { unsigned tmp; -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 01/14] x86/ticketlock: clean up types and accessors
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com A few cleanups to the way spinlocks are defined and accessed: - define __ticket_t which is the size of a spinlock ticket (ie, enough bits to hold all the cpus) - Define struct arch_spinlock as a union containing plain slock and the head and tail tickets - Use head and tail to implement some of the spinlock predicates. - Make all ticket variables unsigned. - Use TICKET_SHIFT to form constants Most of this will be used in later patches. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/spinlock.h | 24 ++-- arch/x86/include/asm/spinlock_types.h | 20 ++-- 2 files changed, 28 insertions(+), 16 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 3089f70..d6d5784 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -56,11 +56,9 @@ * much between them in performance though, especially as locks are out of line. */ #if (NR_CPUS 256) -#define TICKET_SHIFT 8 - static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) { - short inc = 0x0100; + unsigned short inc = 1 TICKET_SHIFT; asm volatile ( LOCK_PREFIX xaddw %w0, %1\n @@ -79,7 +77,7 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) { - int tmp, new; + unsigned int tmp, new; asm volatile(movzwl %2, %0\n\t cmpb %h0,%b0\n\t @@ -104,12 +102,10 @@ static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock) : memory, cc); } #else -#define TICKET_SHIFT 16 - static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) { - int inc = 0x0001; - int tmp; + unsigned inc = 1 TICKET_SHIFT; + unsigned tmp; asm volatile(LOCK_PREFIX xaddl %0, %1\n movzwl %w0, %2\n\t @@ -129,8 +125,8 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) { - int tmp; - int new; + unsigned tmp; + unsigned new; asm volatile(movl %2,%0\n\t movl %0,%1\n\t @@ -160,16 +156,16 @@ static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock) static inline int __ticket_spin_is_locked(arch_spinlock_t *lock) { - int tmp = ACCESS_ONCE(lock-slock); + struct __raw_tickets tmp = ACCESS_ONCE(lock-tickets); - return !!(((tmp TICKET_SHIFT) ^ tmp) ((1 TICKET_SHIFT) - 1)); + return !!(tmp.tail ^ tmp.head); } static inline int __ticket_spin_is_contended(arch_spinlock_t *lock) { - int tmp = ACCESS_ONCE(lock-slock); + struct __raw_tickets tmp = ACCESS_ONCE(lock-tickets); - return (((tmp TICKET_SHIFT) - tmp) ((1 TICKET_SHIFT) - 1)) 1; + return ((tmp.tail - tmp.head) TICKET_MASK) 1; } #ifndef CONFIG_PARAVIRT_SPINLOCKS diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h index dcb48b2..e3ad1e3 100644 --- a/arch/x86/include/asm/spinlock_types.h +++ b/arch/x86/include/asm/spinlock_types.h @@ -5,11 +5,27 @@ # error please don't include this file directly #endif +#include linux/types.h + +#if (CONFIG_NR_CPUS 256) +typedef u8 __ticket_t; +#else +typedef u16 __ticket_t; +#endif + +#define TICKET_SHIFT (sizeof(__ticket_t) * 8) +#define TICKET_MASK((__ticket_t)((1 TICKET_SHIFT) - 1)) + typedef struct arch_spinlock { - unsigned int slock; + union { + unsigned int slock; + struct __raw_tickets { + __ticket_t head, tail; + } tickets; + }; } arch_spinlock_t; -#define __ARCH_SPIN_LOCK_UNLOCKED { 0 } +#define __ARCH_SPIN_LOCK_UNLOCKED { { .slock = 0 } } typedef struct { unsigned int lock; -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 11/14] x86/ticketlock: don't inline _spin_unlock when using paravirt spinlocks
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com The code size expands somewhat, and its probably better to just call a function rather than inline it. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/Kconfig |3 +++ kernel/Kconfig.locks |2 +- 2 files changed, 4 insertions(+), 1 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index e832768..a615c9c 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -531,6 +531,9 @@ config PARAVIRT_SPINLOCKS If you are unsure how to answer this question, answer N. +config ARCH_NOINLINE_SPIN_UNLOCK + def_bool PARAVIRT_SPINLOCKS + config PARAVIRT_CLOCK bool diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks index 88c92fb..3216c22 100644 --- a/kernel/Kconfig.locks +++ b/kernel/Kconfig.locks @@ -125,7 +125,7 @@ config INLINE_SPIN_LOCK_IRQSAVE ARCH_INLINE_SPIN_LOCK_IRQSAVE config INLINE_SPIN_UNLOCK - def_bool !DEBUG_SPINLOCK (!PREEMPT || ARCH_INLINE_SPIN_UNLOCK) + def_bool !DEBUG_SPINLOCK (!PREEMPT || ARCH_INLINE_SPIN_UNLOCK) !ARCH_NOINLINE_SPIN_UNLOCK config INLINE_SPIN_UNLOCK_BH def_bool !DEBUG_SPINLOCK ARCH_INLINE_SPIN_UNLOCK_BH -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 06/14] x86/ticketlock: make __ticket_spin_trylock common
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Make trylock code common regardless of ticket size. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/spinlock.h | 49 +++-- arch/x86/include/asm/spinlock_types.h |6 +++- 2 files changed, 14 insertions(+), 41 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index f722f96..3afb1a7 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -98,48 +98,19 @@ static __always_inline void __ticket_spin_lock(struct arch_spinlock *lock) out: barrier(); /* make sure nothing creeps before the lock is taken */ } -#if (NR_CPUS 256) static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) { - unsigned int tmp, new; - - asm volatile(movzwl %2, %0\n\t -cmpb %h0,%b0\n\t -leal 0x100(% REG_PTR_MODE 0), %1\n\t -jne 1f\n\t -LOCK_PREFIX cmpxchgw %w1,%2\n\t -1: -sete %b1\n\t -movzbl %b1,%0\n\t -: =a (tmp), =q (new), +m (lock-slock) -: -: memory, cc); - - return tmp; -} -#else -static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) -{ - unsigned tmp; - unsigned new; - - asm volatile(movl %2,%0\n\t -movl %0,%1\n\t -roll $16, %0\n\t -cmpl %0,%1\n\t -leal 0x0001(% REG_PTR_MODE 0), %1\n\t -jne 1f\n\t -LOCK_PREFIX cmpxchgl %1,%2\n\t -1: -sete %b1\n\t -movzbl %b1,%0\n\t -: =a (tmp), =q (new), +m (lock-slock) -: -: memory, cc); - - return tmp; + arch_spinlock_t old, new; + + old.tickets = ACCESS_ONCE(lock-tickets); + if (old.tickets.head != old.tickets.tail) + return 0; + + new.head_tail = old.head_tail + (1 TICKET_SHIFT); + + /* cmpxchg is a full barrier, so nothing can move before it */ + return cmpxchg(lock-head_tail, old.head_tail, new.head_tail) == old.head_tail; } -#endif static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock) { diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h index e3ad1e3..72e154e 100644 --- a/arch/x86/include/asm/spinlock_types.h +++ b/arch/x86/include/asm/spinlock_types.h @@ -9,8 +9,10 @@ #if (CONFIG_NR_CPUS 256) typedef u8 __ticket_t; +typedef u16 __ticketpair_t; #else typedef u16 __ticket_t; +typedef u32 __ticketpair_t; #endif #define TICKET_SHIFT (sizeof(__ticket_t) * 8) @@ -18,14 +20,14 @@ typedef u16 __ticket_t; typedef struct arch_spinlock { union { - unsigned int slock; + __ticketpair_t head_tail; struct __raw_tickets { __ticket_t head, tail; } tickets; }; } arch_spinlock_t; -#define __ARCH_SPIN_LOCK_UNLOCKED { { .slock = 0 } } +#define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } } typedef struct { unsigned int lock; -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 14/14] x86/ticketlocks: tidy up __ticket_unlock_kick()
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com __ticket_unlock_kick() is now only called from known slowpaths, so there's no need for it to do any checking of its own. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/paravirt.h |2 +- arch/x86/include/asm/spinlock.h | 14 -- 2 files changed, 1 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 6f275ca..7755b16 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -722,7 +722,7 @@ static inline void __ticket_lock_spinning(struct arch_spinlock *lock, unsigned t PVOP_VCALLEE2(pv_lock_ops.lock_spinning, lock, ticket); } -static inline void ticket_unlock_kick(struct arch_spinlock *lock, unsigned ticket) +static inline void __ticket_unlock_kick(struct arch_spinlock *lock, unsigned ticket) { PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket); } diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 8d1cb42..70675bc 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -90,10 +90,6 @@ static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock, u { } -static __always_inline void ticket_unlock_kick(struct arch_spinlock *lock, unsigned ticket) -{ -} - #endif /* CONFIG_PARAVIRT_SPINLOCKS */ /* @@ -133,16 +129,6 @@ static __always_inline struct __raw_tickets __ticket_spin_claim(struct arch_spin return tickets.tickets; } -/* - * If a spinlock has someone waiting on it, then kick the appropriate - * waiting cpu. - */ -static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock, __ticket_t next) -{ - if (unlikely(lock-tickets.tail != next)) - ticket_unlock_kick(lock, next); -} - static __always_inline void arch_spin_lock(struct arch_spinlock *lock) { register struct __raw_tickets inc; -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 10/14] x86/pvticketlock: use callee-save for lock_spinning
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Although the lock_spinning calls in the spinlock code are on the uncommon path, their presence can cause the compiler to generate many more register save/restores in the function pre/postamble, which is in the fast path. To avoid this, convert it to using the pvops callee-save calling convention, which defers all the save/restores until the actual function is called, keeping the fastpath clean. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/paravirt.h |2 +- arch/x86/include/asm/paravirt_types.h |2 +- arch/x86/kernel/paravirt-spinlocks.c |2 +- arch/x86/xen/spinlock.c |3 ++- 4 files changed, 5 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index c864775..6f275ca 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -719,7 +719,7 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx, static inline void __ticket_lock_spinning(struct arch_spinlock *lock, unsigned ticket) { - PVOP_VCALL2(pv_lock_ops.lock_spinning, lock, ticket); + PVOP_VCALLEE2(pv_lock_ops.lock_spinning, lock, ticket); } static inline void ticket_unlock_kick(struct arch_spinlock *lock, unsigned ticket) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 1078474..53f249a 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -315,7 +315,7 @@ struct pv_mmu_ops { struct arch_spinlock; struct pv_lock_ops { - void (*lock_spinning)(struct arch_spinlock *lock, unsigned ticket); + struct paravirt_callee_save lock_spinning; void (*unlock_kick)(struct arch_spinlock *lock, unsigned ticket); }; diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index c2e010e..4251c1d 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -9,7 +9,7 @@ struct pv_lock_ops pv_lock_ops = { #ifdef CONFIG_SMP - .lock_spinning = paravirt_nop, + .lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop), .unlock_kick = paravirt_nop, #endif }; diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index 5feb897..c31c5a3 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -137,6 +137,7 @@ out: w-lock = NULL; spin_time_accum_blocked(start); } +PV_CALLEE_SAVE_REGS_THUNK(xen_lock_spinning); static void xen_unlock_kick(struct arch_spinlock *lock, unsigned next) { @@ -189,7 +190,7 @@ void xen_uninit_lock_cpu(int cpu) void __init xen_init_spinlocks(void) { - pv_lock_ops.lock_spinning = xen_lock_spinning; + pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(xen_lock_spinning); pv_lock_ops.unlock_kick = xen_unlock_kick; } -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 09/14] xen/pvticketlock: Xen implementation for PV ticket locks
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Replace the old Xen implementation of PV spinlocks with and implementation of xen_lock_spinning and xen_unlock_kick. xen_lock_spinning simply registers the cpu in its entry in lock_waiting, adds itself to the waiting_cpus set, and blocks on an event channel until the channel becomes pending. xen_unlock_kick searches the cpus in waiting_cpus looking for the one which next wants this lock with the next ticket, if any. If found, it kicks it by making its event channel pending, which wakes it up. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/xen/spinlock.c | 281 ++- 1 files changed, 36 insertions(+), 245 deletions(-) diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index 3d9da72..5feb897 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -19,32 +19,21 @@ #ifdef CONFIG_XEN_DEBUG_FS static struct xen_spinlock_stats { - u64 taken; u32 taken_slow; - u32 taken_slow_nested; u32 taken_slow_pickup; u32 taken_slow_spurious; - u32 taken_slow_irqenable; - u64 released; u32 released_slow; u32 released_slow_kicked; #define HISTO_BUCKETS 30 - u32 histo_spin_total[HISTO_BUCKETS+1]; - u32 histo_spin_spinning[HISTO_BUCKETS+1]; u32 histo_spin_blocked[HISTO_BUCKETS+1]; - u64 time_total; - u64 time_spinning; u64 time_blocked; } spinlock_stats; static u8 zero_stats; -static unsigned lock_timeout = 1 10; -#define TIMEOUT lock_timeout - static inline void check_zero(void) { if (unlikely(zero_stats)) { @@ -73,22 +62,6 @@ static void __spin_time_accum(u64 delta, u32 *array) array[HISTO_BUCKETS]++; } -static inline void spin_time_accum_spinning(u64 start) -{ - u32 delta = xen_clocksource_read() - start; - - __spin_time_accum(delta, spinlock_stats.histo_spin_spinning); - spinlock_stats.time_spinning += delta; -} - -static inline void spin_time_accum_total(u64 start) -{ - u32 delta = xen_clocksource_read() - start; - - __spin_time_accum(delta, spinlock_stats.histo_spin_total); - spinlock_stats.time_total += delta; -} - static inline void spin_time_accum_blocked(u64 start) { u32 delta = xen_clocksource_read() - start; @@ -105,214 +78,76 @@ static inline u64 spin_time_start(void) return 0; } -static inline void spin_time_accum_total(u64 start) -{ -} -static inline void spin_time_accum_spinning(u64 start) -{ -} static inline void spin_time_accum_blocked(u64 start) { } #endif /* CONFIG_XEN_DEBUG_FS */ -struct xen_spinlock { - unsigned char lock; /* 0 - free; 1 - locked */ - unsigned short spinners;/* count of waiting cpus */ +struct xen_lock_waiting { + struct arch_spinlock *lock; + __ticket_t want; }; static DEFINE_PER_CPU(int, lock_kicker_irq) = -1; +static DEFINE_PER_CPU(struct xen_lock_waiting, lock_waiting); +static cpumask_t waiting_cpus; -#if 0 -static int xen_spin_is_locked(struct arch_spinlock *lock) -{ - struct xen_spinlock *xl = (struct xen_spinlock *)lock; - - return xl-lock != 0; -} - -static int xen_spin_is_contended(struct arch_spinlock *lock) -{ - struct xen_spinlock *xl = (struct xen_spinlock *)lock; - - /* Not strictly true; this is only the count of contended - lock-takers entering the slow path. */ - return xl-spinners != 0; -} - -static int xen_spin_trylock(struct arch_spinlock *lock) -{ - struct xen_spinlock *xl = (struct xen_spinlock *)lock; - u8 old = 1; - - asm(xchgb %b0,%1 - : +q (old), +m (xl-lock) : : memory); - - return old == 0; -} - -static DEFINE_PER_CPU(struct xen_spinlock *, lock_spinners); - -/* - * Mark a cpu as interested in a lock. Returns the CPU's previous - * lock of interest, in case we got preempted by an interrupt. - */ -static inline struct xen_spinlock *spinning_lock(struct xen_spinlock *xl) -{ - struct xen_spinlock *prev; - - prev = __get_cpu_var(lock_spinners); - __get_cpu_var(lock_spinners) = xl; - - wmb(); /* set lock of interest before count */ - - asm(LOCK_PREFIX incw %0 - : +m (xl-spinners) : : memory); - - return prev; -} - -/* - * Mark a cpu as no longer interested in a lock. Restores previous - * lock of interest (NULL for none). - */ -static inline void unspinning_lock(struct xen_spinlock *xl, struct xen_spinlock *prev) -{ - asm(LOCK_PREFIX decw %0 - : +m (xl-spinners) : : memory); - wmb(); /* decrement count before restoring lock */ - __get_cpu_var(lock_spinners) = prev; -} - -static noinline int xen_spin_lock_slow(struct arch_spinlock *lock, bool irq_enable) +static void xen_lock_spinning(struct arch_spinlock *lock, unsigned want) { - struct xen_spinlock *xl
[PATCH 12/14] x86/ticketlocks: when paravirtualizing ticket locks, increment by 2
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Increment ticket head/tails by 2 rather than 1 to leave the LSB free to store a is in slowpath state bit. This halves the number of possible CPUs for a given ticket size, but this shouldn't matter in practice - kernels built for 32k+ CPU systems are probably specially built for the hardware rather than a generic distro kernel. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/spinlock.h | 18 +- arch/x86/include/asm/spinlock_types.h | 10 +- 2 files changed, 18 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index cfa80b5..9e1c7ce 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -36,17 +36,17 @@ static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock) { if (sizeof(lock-tickets.head) == sizeof(u8)) - asm (LOCK_PREFIX incb %0 -: +m (lock-tickets.head) : : memory); + asm (LOCK_PREFIX addb %1, %0 +: +m (lock-tickets.head) : i (TICKET_LOCK_INC) : memory); else - asm (LOCK_PREFIX incw %0 -: +m (lock-tickets.head) : : memory); + asm (LOCK_PREFIX addw %1, %0 +: +m (lock-tickets.head) : i (TICKET_LOCK_INC) : memory); } #else static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock) { - lock-tickets.head++; + lock-tickets.head += TICKET_LOCK_INC; } #endif @@ -84,7 +84,7 @@ static __always_inline void ticket_unlock_kick(struct arch_spinlock *lock, u */ static __always_inline struct __raw_tickets __ticket_spin_claim(struct arch_spinlock *lock) { - register struct __raw_tickets tickets = { .tail = 1 }; + register struct __raw_tickets tickets = { .tail = TICKET_LOCK_INC }; if (sizeof(lock-tickets.head) == sizeof(u8)) asm volatile (LOCK_PREFIX xaddw %w0, %1\n @@ -136,7 +136,7 @@ static __always_inline int arch_spin_trylock(arch_spinlock_t *lock) if (old.tickets.head != old.tickets.tail) return 0; - new.head_tail = old.head_tail + (1 TICKET_SHIFT); + new.head_tail = old.head_tail + (TICKET_LOCK_INC TICKET_SHIFT); /* cmpxchg is a full barrier, so nothing can move before it */ return cmpxchg(lock-head_tail, old.head_tail, new.head_tail) == old.head_tail; @@ -144,7 +144,7 @@ static __always_inline int arch_spin_trylock(arch_spinlock_t *lock) static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) { - __ticket_t next = lock-tickets.head + 1; + __ticket_t next = lock-tickets.head + TICKET_LOCK_INC; __ticket_unlock_release(lock); __ticket_unlock_kick(lock, next); barrier(); /* prevent reordering into locked region */ @@ -161,7 +161,7 @@ static inline int arch_spin_is_contended(arch_spinlock_t *lock) { struct __raw_tickets tmp = ACCESS_ONCE(lock-tickets); - return ((tmp.tail - tmp.head) TICKET_MASK) 1; + return ((tmp.tail - tmp.head) TICKET_MASK) TICKET_LOCK_INC; } #define arch_spin_is_contended arch_spin_is_contended diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h index 72e154e..0553c0b 100644 --- a/arch/x86/include/asm/spinlock_types.h +++ b/arch/x86/include/asm/spinlock_types.h @@ -7,7 +7,13 @@ #include linux/types.h -#if (CONFIG_NR_CPUS 256) +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#define __TICKET_LOCK_INC 2 +#else +#define __TICKET_LOCK_INC 1 +#endif + +#if (CONFIG_NR_CPUS (256 / __TICKET_LOCK_INC)) typedef u8 __ticket_t; typedef u16 __ticketpair_t; #else @@ -15,6 +21,8 @@ typedef u16 __ticket_t; typedef u32 __ticketpair_t; #endif +#define TICKET_LOCK_INC((__ticket_t)__TICKET_LOCK_INC) + #define TICKET_SHIFT (sizeof(__ticket_t) * 8) #define TICKET_MASK((__ticket_t)((1 TICKET_SHIFT) - 1)) -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH -next] xen: fix header export to userspace
On 11/13/2010 08:44 AM, Randy Dunlap wrote: From: Randy Dunlap randy.dun...@oracle.com scripts/headers_install.pl prevents __user from being exported to userspace headers, so just use compiler.h to make sure that __user is defined and avoid the error. unifdef: linux-next-20101112/xx64/usr/include/xen/privcmd.h.tmp: 79: Premature EOF (#if line 33 depth 1) Ah, OK, thanks. I was wondering what the proper fix for this was. I'll stick this in my tree. Thanks, J Signed-off-by: Randy Dunlap randy.dun...@oracle.com Cc: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com Cc: xen-de...@lists.xensource.com (moderated for non-subscribers) Cc: virtualizat...@lists.osdl.org Cc: Tony Finch d...@dotat.at --- include/xen/privcmd.h |5 + 1 file changed, 1 insertion(+), 4 deletions(-) --- linux-next-20101112.orig/include/xen/privcmd.h +++ linux-next-20101112/include/xen/privcmd.h @@ -34,13 +34,10 @@ #define __LINUX_PUBLIC_PRIVCMD_H__ #include linux/types.h +#include linux/compiler.h typedef unsigned long xen_pfn_t; -#ifndef __user -#define __user -#endif - struct privcmd_hypercall { __u64 op; __u64 arg[5]; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 01/20] x86/ticketlock: clean up types and accessors
On 11/13/2010 01:57 AM, Américo Wang wrote: On Wed, Nov 03, 2010 at 10:59:42AM -0400, Jeremy Fitzhardinge wrote: ... static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) { -int inc = 0x0001; -int tmp; +unsigned inc = 1 TICKET_SHIFT; +unsigned tmp; Please don't use old implicit-int. I don't mind much either way, but I don't think I've seen anyone worry about this before. -return (((tmp TICKET_SHIFT) - tmp) ((1 TICKET_SHIFT) - 1)) 1; +return ((tmp.tail - tmp.head) TICKET_MASK) 1; There is a type promotion here. ... } #ifndef CONFIG_PARAVIRT_SPINLOCKS diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h index dcb48b2..4582640 100644 --- a/arch/x86/include/asm/spinlock_types.h +++ b/arch/x86/include/asm/spinlock_types.h @@ -5,11 +5,27 @@ # error please don't include this file directly #endif +#include linux/types.h + +#if (CONFIG_NR_CPUS 256) +typedef u8 __ticket_t; +#else +typedef u16 __ticket_t; +#endif + +#define TICKET_SHIFT(sizeof(__ticket_t) * 8) +#define TICKET_MASK ((1 TICKET_SHIFT) - 1) So here you may need to cast the result to __ticket_t. OK. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 03/20] x86/ticketlock: Use C for __ticket_spin_unlock
On 11/13/2010 02:05 AM, Américo Wang wrote: On Wed, Nov 03, 2010 at 10:59:44AM -0400, Jeremy Fitzhardinge wrote: * On PPro SMP or if we are using OOSTORE, we use a locked operation to unlock * (PPro errata 66, 92) */ -# define UNLOCK_LOCK_PREFIX LOCK_PREFIX +static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock) +{ +if (sizeof(lock-tickets.head) == sizeof(u8)) +asm (LOCK_PREFIX incb %0 + : +m (lock-tickets.head) : : memory); +else +asm (LOCK_PREFIX incw %0 + : +m (lock-tickets.head) : : memory); This 'if/else' really should be done with #ifdef, even though the compiler may be smart enough to remove it. No, we depend on if/else with constant arguments doing the right thing all over the kernel. It is always preferable to use it instead of #ifdef where possible, so that the two branches of code are always subjected to compiler checking, even if they're not being used. + +} #else -# define UNLOCK_LOCK_PREFIX +static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock) +{ +barrier(); +lock-tickets.head++; +barrier(); The second barrier() is not needed. Agreed. It gets removed in a later patch. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 06/20] x86/ticketlock: make __ticket_spin_trylock common
On 11/13/2010 02:48 AM, Eric Dumazet wrote: Le samedi 13 novembre 2010 à 18:17 +0800, Américo Wang a écrit : On Wed, Nov 03, 2010 at 10:59:47AM -0400, Jeremy Fitzhardinge wrote: + union { + struct __raw_tickets tickets; + __ticketpair_t slock; + } tmp, new; + int ret; + + tmp.tickets = ACCESS_ONCE(lock-tickets); + if (tmp.tickets.head != tmp.tickets.tail) + return 0; + + new.slock = tmp.slock + (1 TICKET_SHIFT); + + ret = cmpxchg(lock-ticketpair, tmp.slock, new.slock) == tmp.slock; + barrier(); /* just make nothing creeps before lock is claimed */ This one should be smp_wmb(), right? No CONFIG_X86_OOSTORE protected. cmpxchg() is a full memory barrier, no need for smp_wmb() or barrier() Agreed. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 00/20] x86: ticket lock rewrite and paravirtualization
On 11/12/2010 02:20 PM, H. Peter Anvin wrote: On 11/12/2010 02:17 PM, Jeremy Fitzhardinge wrote: On 11/12/2010 02:12 PM, H. Peter Anvin wrote: On 11/03/2010 07:59 AM, Jeremy Fitzhardinge wrote: - with an unmodified struct spinlock, it can check to see if head == tail after unlock; if not, then there's someone else trying to lock, and we can do a kick. Unfortunately this generates very high level of redundant kicks, because the waiting CPU might not have blocked yet (which is the common case) How high is very high here -- most of the time (so that any mitigation on the slow patch is useless)? I'll need to remeasure, but I think around 90% of the slowpath entries were spurious without this. In other words, when spinlocks do contend, most of the time it isn't very serious and the other cpu doesn't spend much time spinning. 90% of the slowpath entries is one thing, my real question is the fraction of fastpath entries that get diverted to the slowpath. It affects where mitigation needs to happen. There are two questions: how many unlock events *must* go into the slowpath for correctness reasons (ie, because the corresponding lock also went slowpath and got blocked there), and how many end up going into the slowpath due to imperfect heuristics? The number of lock events which go slowpath is very dependent on the workload of the kernel in question and of the machine overall. On a system with no CPU overcommit it should be zero (assuming that in the native case no Linux spinlock remains contended for so long that it will trigger the slowpath). On a very overcommitted system, it comes down to what the likelihood that a VCPU will get preempted while running in a critical region: Tcrit * Phz, where Tcrit is the critical section time in S and Phz is the preemption rate of the VCPU scheduler in Hz. So, for example, a lock with a 10uS critical section and a 100Hz preemption rate will have a .1% chance of getting preempted and possibly causing the other lockers to enter the slow path. On the unlock side, it needs to test whether lock has any waiters in a slowpath state. A conservative test is whether there are any outstanding tickets, but in my measurements 90% of CPUs which spun on a lock ended up getting it without having to take the slowpath. This lead me to investigate more precise tests, which is currently a count of slowpath-entering CPUs waiting on the lock. Another approach I discussed with PeterZ and Mathieu is to steal the LSB of the ticket counters (halving the max CPU count) to use as a there is someone in slowpath waiting on this lock. But I haven't spent the time to work out an algorithm to maintain that flag (or flags, since there are bits available) in a correct and efficient way. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 00/20] x86: ticket lock rewrite and paravirtualization
On 11/15/2010 12:14 PM, Peter Zijlstra wrote: On Mon, 2010-11-15 at 12:03 -0800, H. Peter Anvin wrote: On 11/15/2010 12:00 PM, Jeremy Fitzhardinge wrote: Another approach I discussed with PeterZ and Mathieu is to steal the LSB of the ticket counters (halving the max CPU count) to use as a there is someone in slowpath waiting on this lock. But I haven't spent the time to work out an algorithm to maintain that flag (or flags, since there are bits available) in a correct and efficient way. Definitely worth pondering. Right, so the idea was to make the ticket increment 2, which would leave the LSB of both the head and tail available. I think that if one were to set both (using a cmpxchg), the ticket fast-path wouldn't need any changes since head==tail is still the correct condition for acquisition. Then the unlock needs an added conditional: if (tail 1) unlock_slowpath() The tricky part is knowing how to clear the bit(s) on the last person dropping out of the slow path, and making that race-free with respect to new lockers entering the slow path. I guess you could leave it in slowpath state until you're the last unlocker (ie, you're unlocking into uncontended state), whereupon you also clear the bits; I guess that would probably need a cmpxchg to make it safe WRT new lockers entering slowpath. As a heuristic, it shouldn't be too bad performancewise, since (handwaving) if ticketholder N has entered the slowpath, then its likely that N+1 will as well. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 04/20] x86/ticketlock: make large and small ticket versions of spin_lock the same
On 11/12/2010 04:19 AM, Srivatsa Vaddagiri wrote: On Wed, Nov 03, 2010 at 10:59:45AM -0400, Jeremy Fitzhardinge wrote: Make the bulk of __ticket_spin_lock look identical for large and small number of cpus. [snip] #if (NR_CPUS 256) static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) { -register union { -struct __raw_tickets tickets; -unsigned short slock; -} inc = { .slock = 1 TICKET_SHIFT }; +register struct __raw_tickets inc = { .tail = 1 }; [snip] #else static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) { -unsigned inc = 1 TICKET_SHIFT; -__ticket_t tmp; +register struct __raw_tickets inc = { .tickets.tail = 1 }; s/.tickets//? Otherwise I get a compile error for NR_CPUS 256, with just 4 patches applied. Yeah, likely. That's precisely why I wanted to make them the same ;). J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 00/20] x86: ticket lock rewrite and paravirtualization
On 11/12/2010 02:12 PM, H. Peter Anvin wrote: On 11/03/2010 07:59 AM, Jeremy Fitzhardinge wrote: - with an unmodified struct spinlock, it can check to see if head == tail after unlock; if not, then there's someone else trying to lock, and we can do a kick. Unfortunately this generates very high level of redundant kicks, because the waiting CPU might not have blocked yet (which is the common case) How high is very high here -- most of the time (so that any mitigation on the slow patch is useless)? I'll need to remeasure, but I think around 90% of the slowpath entries were spurious without this. In other words, when spinlocks do contend, most of the time it isn't very serious and the other cpu doesn't spend much time spinning. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/3] VIDEO: xen-fb, switch to for_each_console
On 11/03/2010 11:28 AM, Jiri Slaby wrote: Use newly added for_each_console for iterating consoles. Signed-off-by: Jiri Slaby jsl...@suse.cz Cc: Jeremy Fitzhardinge jer...@xensource.com Sure, if that's what all the kids are doing these days. Acked-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com J Cc: Chris Wright chr...@sous-sol.org Cc: virtualizat...@lists.osdl.org Cc: xen-de...@lists.xensource.com Cc: linux-fb...@vger.kernel.org --- drivers/video/xen-fbfront.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/video/xen-fbfront.c b/drivers/video/xen-fbfront.c index 428d273..4abb0b9 100644 --- a/drivers/video/xen-fbfront.c +++ b/drivers/video/xen-fbfront.c @@ -492,7 +492,7 @@ xenfb_make_preferred_console(void) return; acquire_console_sem(); - for (c = console_drivers; c; c = c-next) { + for_each_console(c) { if (!strcmp(c-name, tty) c-index == 0) break; } ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 04/20] x86/ticketlock: make large and small ticket versions of spin_lock the same
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Make the bulk of __ticket_spin_lock look identical for large and small number of cpus. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/spinlock.h | 23 --- 1 files changed, 8 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 082990a..7586d7a 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -72,19 +72,16 @@ static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock) #if (NR_CPUS 256) static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) { - register union { - struct __raw_tickets tickets; - unsigned short slock; - } inc = { .slock = 1 TICKET_SHIFT }; + register struct __raw_tickets inc = { .tail = 1 }; asm volatile (LOCK_PREFIX xaddw %w0, %1\n - : +Q (inc), +m (lock-slock) : : memory, cc); + : +r (inc), +m (lock-tickets) : : memory, cc); for (;;) { - if (inc.tickets.head == inc.tickets.tail) + if (inc.head == inc.tail) return; cpu_relax(); - inc.tickets.head = ACCESS_ONCE(lock-tickets.head); + inc.head = ACCESS_ONCE(lock-tickets.head); } barrier(); /* make sure nothing creeps before the lock is taken */ } @@ -110,21 +107,17 @@ static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) #else static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) { - unsigned inc = 1 TICKET_SHIFT; - __ticket_t tmp; + register struct __raw_tickets inc = { .tickets.tail = 1 }; asm volatile(LOCK_PREFIX xaddl %0, %1\n\t -: +r (inc), +m (lock-slock) +: +r (inc), +m (lock-tickets) : : memory, cc); - tmp = inc; - inc = TICKET_SHIFT; - for (;;) { - if ((__ticket_t)inc == tmp) + if (inc.head == inc.tail) return; cpu_relax(); - tmp = ACCESS_ONCE(lock-tickets.head); + inc.head = ACCESS_ONCE(lock-tickets.head); } barrier(); /* make sure nothing creeps before the lock is taken */ } -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 03/20] x86/ticketlock: Use C for __ticket_spin_unlock
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com If we don't need to use a locked inc for unlock, then implement it in C. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/spinlock.h | 33 ++--- 1 files changed, 18 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 6711d36..082990a 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -33,9 +33,23 @@ * On PPro SMP or if we are using OOSTORE, we use a locked operation to unlock * (PPro errata 66, 92) */ -# define UNLOCK_LOCK_PREFIX LOCK_PREFIX +static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock) +{ + if (sizeof(lock-tickets.head) == sizeof(u8)) + asm (LOCK_PREFIX incb %0 +: +m (lock-tickets.head) : : memory); + else + asm (LOCK_PREFIX incw %0 +: +m (lock-tickets.head) : : memory); + +} #else -# define UNLOCK_LOCK_PREFIX +static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock) +{ + barrier(); + lock-tickets.head++; + barrier(); +} #endif /* @@ -93,14 +107,6 @@ static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) return tmp; } - -static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock) -{ - asm volatile(UNLOCK_LOCK_PREFIX incb %0 -: +m (lock-slock) -: -: memory, cc); -} #else static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) { @@ -144,15 +150,12 @@ static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) return tmp; } +#endif static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock) { - asm volatile(UNLOCK_LOCK_PREFIX incw %0 -: +m (lock-slock) -: -: memory, cc); + __ticket_unlock_release(lock); } -#endif static inline int __ticket_spin_is_locked(arch_spinlock_t *lock) { -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 08/20] x86/ticketlock: collapse a layer of functions
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Now that the paravirtualization layer doesn't exist at the spinlock level any more, we can collapse the __ticket_ functions into the arch_ functions. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/spinlock.h | 35 +-- 1 files changed, 5 insertions(+), 30 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 06815f3..a79dfee 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -110,7 +110,7 @@ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock, __t ticket_unlock_kick(lock, next); } -static __always_inline void __ticket_spin_lock(struct arch_spinlock *lock) +static __always_inline void arch_spin_lock(struct arch_spinlock *lock) { register struct __raw_tickets inc; @@ -130,7 +130,7 @@ static __always_inline void __ticket_spin_lock(struct arch_spinlock *lock) barrier(); /* make sure nothing creeps before the lock is taken */ } -static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) +static __always_inline int arch_spin_trylock(arch_spinlock_t *lock) { union { struct __raw_tickets tickets; @@ -150,53 +150,28 @@ static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) return ret; } -static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock) +static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) { __ticket_t next = lock-tickets.head + 1; __ticket_unlock_release(lock); __ticket_unlock_kick(lock, next); } -static inline int __ticket_spin_is_locked(arch_spinlock_t *lock) +static inline int arch_spin_is_locked(arch_spinlock_t *lock) { struct __raw_tickets tmp = ACCESS_ONCE(lock-tickets); return !!(tmp.tail ^ tmp.head); } -static inline int __ticket_spin_is_contended(arch_spinlock_t *lock) +static inline int arch_spin_is_contended(arch_spinlock_t *lock) { struct __raw_tickets tmp = ACCESS_ONCE(lock-tickets); return ((tmp.tail - tmp.head) TICKET_MASK) 1; } - -static inline int arch_spin_is_locked(arch_spinlock_t *lock) -{ - return __ticket_spin_is_locked(lock); -} - -static inline int arch_spin_is_contended(arch_spinlock_t *lock) -{ - return __ticket_spin_is_contended(lock); -} #define arch_spin_is_contended arch_spin_is_contended -static __always_inline void arch_spin_lock(arch_spinlock_t *lock) -{ - __ticket_spin_lock(lock); -} - -static __always_inline int arch_spin_trylock(arch_spinlock_t *lock) -{ - return __ticket_spin_trylock(lock); -} - -static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) -{ - __ticket_spin_unlock(lock); -} - static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags) { -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 06/20] x86/ticketlock: make __ticket_spin_trylock common
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Make trylock code common regardless of ticket size. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/spinlock.h | 55 +--- arch/x86/include/asm/spinlock_types.h |3 ++ 2 files changed, 19 insertions(+), 39 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 4f9fa24..507f83c 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -100,48 +100,25 @@ static __always_inline void __ticket_spin_lock(struct arch_spinlock *lock) barrier(); /* make sure nothing creeps before the lock is taken */ } -#if (NR_CPUS 256) static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) { - unsigned int tmp, new; - - asm volatile(movzwl %2, %0\n\t -cmpb %h0,%b0\n\t -leal 0x100(% REG_PTR_MODE 0), %1\n\t -jne 1f\n\t -LOCK_PREFIX cmpxchgw %w1,%2\n\t -1: -sete %b1\n\t -movzbl %b1,%0\n\t -: =a (tmp), =q (new), +m (lock-slock) -: -: memory, cc); - - return tmp; -} -#else -static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) -{ - unsigned tmp; - unsigned new; - - asm volatile(movl %2,%0\n\t -movl %0,%1\n\t -roll $16, %0\n\t -cmpl %0,%1\n\t -leal 0x0001(% REG_PTR_MODE 0), %1\n\t -jne 1f\n\t -LOCK_PREFIX cmpxchgl %1,%2\n\t -1: -sete %b1\n\t -movzbl %b1,%0\n\t -: =a (tmp), =q (new), +m (lock-slock) -: -: memory, cc); - - return tmp; + union { + struct __raw_tickets tickets; + __ticketpair_t slock; + } tmp, new; + int ret; + + tmp.tickets = ACCESS_ONCE(lock-tickets); + if (tmp.tickets.head != tmp.tickets.tail) + return 0; + + new.slock = tmp.slock + (1 TICKET_SHIFT); + + ret = cmpxchg(lock-ticketpair, tmp.slock, new.slock) == tmp.slock; + barrier(); /* just make nothing creeps before lock is claimed */ + + return ret; } -#endif static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock) { diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h index 4582640..48dafc3 100644 --- a/arch/x86/include/asm/spinlock_types.h +++ b/arch/x86/include/asm/spinlock_types.h @@ -9,8 +9,10 @@ #if (CONFIG_NR_CPUS 256) typedef u8 __ticket_t; +typedef u16 __ticketpair_t; #else typedef u16 __ticket_t; +typedef u32 __ticketpair_t; #endif #define TICKET_SHIFT (sizeof(__ticket_t) * 8) @@ -19,6 +21,7 @@ typedef u16 __ticket_t; typedef struct arch_spinlock { union { unsigned int slock; + __ticketpair_t ticketpair; struct __raw_tickets { __ticket_t head, tail; } tickets; -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 11/20] x86/pvticketlock: use callee-save for lock_spinning
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Although the lock_spinning calls in the spinlock code are on the uncommon path, their presence can cause the compiler to generate many more register save/restores in the function pre/postamble, which is in the fast path. To avoid this, convert it to using the pvops callee-save calling convention, which defers all the save/restores until the actual function is called, keeping the fastpath clean. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/paravirt.h |2 +- arch/x86/include/asm/paravirt_types.h |2 +- arch/x86/kernel/paravirt-spinlocks.c |2 +- arch/x86/xen/spinlock.c |3 ++- 4 files changed, 5 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 79c77e5..dbd5914 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -724,7 +724,7 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx, static inline void __ticket_lock_spinning(struct arch_spinlock *lock, unsigned ticket) { - PVOP_VCALL2(pv_lock_ops.lock_spinning, lock, ticket); + PVOP_VCALLEE2(pv_lock_ops.lock_spinning, lock, ticket); } static inline void ticket_unlock_kick(struct arch_spinlock *lock, unsigned ticket) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 968ee31..9ad6d088 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -316,7 +316,7 @@ struct pv_mmu_ops { struct arch_spinlock; struct pv_lock_ops { - void (*lock_spinning)(struct arch_spinlock *lock, unsigned ticket); + struct paravirt_callee_save lock_spinning; void (*unlock_kick)(struct arch_spinlock *lock, unsigned ticket); }; diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index c2e010e..4251c1d 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -9,7 +9,7 @@ struct pv_lock_ops pv_lock_ops = { #ifdef CONFIG_SMP - .lock_spinning = paravirt_nop, + .lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop), .unlock_kick = paravirt_nop, #endif }; diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index b8f09d5..2bd51d8 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -141,6 +141,7 @@ out: w-lock = NULL; spin_time_accum_blocked(start); } +PV_CALLEE_SAVE_REGS_THUNK(xen_lock_spinning); static void xen_unlock_kick(struct arch_spinlock *lock, unsigned next) { @@ -193,7 +194,7 @@ void xen_uninit_lock_cpu(int cpu) void __init xen_init_spinlocks(void) { - pv_lock_ops.lock_spinning = xen_lock_spinning; + pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(xen_lock_spinning); pv_lock_ops.unlock_kick = xen_unlock_kick; } -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 00/20] x86: ticket lock rewrite and paravirtualization
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Hi all, This series does two major things: 1. It converts the bulk of the implementation to C, and makes the small ticket and large ticket code common. Only the actual size-dependent asm instructions are specific to the ticket size. The resulting generated asm is very similar to the current hand-written code. This results in a very large reduction in lines of code. 2. Get rid of pv spinlocks, and replace them with pv ticket locks. Currently we have the notion of pv spinlocks where a pv-ops backend can completely replace the spinlock implementation with a new one. This has two disadvantages: - its a completely separate spinlock implementation, and - there's an extra layer of indirection in front of every spinlock operation. To replace this, this series introduces the notion of pv ticket locks. In this design, the ticket lock fastpath is the standard ticketlock algorithm. However, after an iteration threshold it falls into a slow path which invokes a pv-op to block the spinning CPU. Conversely, on unlock it does the normal unlock, and then checks to see if it needs to do a special kick to wake the next CPU. The net result is that the pv-op calls are restricted to the slow paths, and the normal fast-paths are largely unaffected. There are still some overheads, however: - When locking, there's some extra tests to count the spin iterations. There are no extra instructions in the uncontended case though. - When unlocking, there are two ways to detect when it is necessary to kick a blocked CPU: - with an unmodified struct spinlock, it can check to see if head == tail after unlock; if not, then there's someone else trying to lock, and we can do a kick. Unfortunately this generates very high level of redundant kicks, because the waiting CPU might not have blocked yet (which is the common case) - With a struct spinlock modified to include a waiters field, to keep count of blocked CPUs, which is a much tighter test. But it does increase the size of each spinlock by 50% (doubled with padding). The series is very fine-grained, and I've left a lightly cleaned up history of the various options I evaluated, since they're not all obvious. Jeremy Fitzhardinge (20): x86/ticketlock: clean up types and accessors x86/ticketlock: convert spin loop to C x86/ticketlock: Use C for __ticket_spin_unlock x86/ticketlock: make large and small ticket versions of spin_lock the same x86/ticketlock: make __ticket_spin_lock common x86/ticketlock: make __ticket_spin_trylock common x86/spinlocks: replace pv spinlocks with pv ticketlocks x86/ticketlock: collapse a layer of functions xen/pvticketlock: Xen implementation for PV ticket locks x86/pvticketlock: keep count of blocked cpus x86/pvticketlock: use callee-save for lock_spinning x86/pvticketlock: use callee-save for unlock_kick as well x86/pvticketlock: make sure unlock is seen by everyone before checking waiters x86/ticketlock: loosen ordering restraints on unlock x86/ticketlock: prevent compiler reordering into locked region x86/ticketlock: don't inline _spin_unlock when using paravirt spinlocks x86/ticketlock: clarify barrier in arch_spin_lock x86/ticketlock: remove .slock x86/ticketlocks: use overlapping read to eliminate mb() x86/ticketlock: rename ticketpair to head_tail arch/x86/Kconfig |3 + arch/x86/include/asm/paravirt.h | 30 +--- arch/x86/include/asm/paravirt_types.h |8 +- arch/x86/include/asm/spinlock.h | 250 +++-- arch/x86/include/asm/spinlock_types.h | 41 +- arch/x86/kernel/paravirt-spinlocks.c | 15 +-- arch/x86/xen/spinlock.c | 282 + kernel/Kconfig.locks |2 +- 8 files changed, 221 insertions(+), 410 deletions(-) -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 12/20] x86/pvticketlock: use callee-save for unlock_kick as well
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com The unlock code is typically inlined throughout the kernel, so its useful to make sure there's minimal register pressure overhead from the presence of the unlock_tick pvop call. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/paravirt.h |2 +- arch/x86/include/asm/paravirt_types.h |2 +- arch/x86/kernel/paravirt-spinlocks.c |2 +- arch/x86/xen/spinlock.c |3 ++- 4 files changed, 5 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index dbd5914..b6ff89a 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -729,7 +729,7 @@ static inline void __ticket_lock_spinning(struct arch_spinlock *lock, unsigned t static inline void ticket_unlock_kick(struct arch_spinlock *lock, unsigned ticket) { - PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket); + PVOP_VCALLEE2(pv_lock_ops.unlock_kick, lock, ticket); } #endif diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 9ad6d088..bb02831 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -317,7 +317,7 @@ struct pv_mmu_ops { struct arch_spinlock; struct pv_lock_ops { struct paravirt_callee_save lock_spinning; - void (*unlock_kick)(struct arch_spinlock *lock, unsigned ticket); + struct paravirt_callee_save unlock_kick; }; /* This contains all the paravirt structures: we get a convenient diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 4251c1d..efad219 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -10,7 +10,7 @@ struct pv_lock_ops pv_lock_ops = { #ifdef CONFIG_SMP .lock_spinning = __PV_IS_CALLEE_SAVE(paravirt_nop), - .unlock_kick = paravirt_nop, + .unlock_kick = __PV_IS_CALLEE_SAVE(paravirt_nop), #endif }; EXPORT_SYMBOL(pv_lock_ops); diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index 2bd51d8..bf2f409 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -159,6 +159,7 @@ static void xen_unlock_kick(struct arch_spinlock *lock, unsigned next) } } } +PV_CALLEE_SAVE_REGS_THUNK(xen_unlock_kick); static irqreturn_t dummy_handler(int irq, void *dev_id) { @@ -195,7 +196,7 @@ void xen_uninit_lock_cpu(int cpu) void __init xen_init_spinlocks(void) { pv_lock_ops.lock_spinning = PV_CALLEE_SAVE(xen_lock_spinning); - pv_lock_ops.unlock_kick = xen_unlock_kick; + pv_lock_ops.unlock_kick = PV_CALLEE_SAVE(xen_unlock_kick); } #ifdef CONFIG_XEN_DEBUG_FS -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 17/20] x86/ticketlock: clarify barrier in arch_spin_lock
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Make sure the barrier in arch_spin_lock is definitely in the code path. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/spinlock.h |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 158b330..c7455e1 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -151,13 +151,13 @@ static __always_inline void arch_spin_lock(struct arch_spinlock *lock) do { if (inc.head == inc.tail) - return; + goto out; cpu_relax(); inc.head = ACCESS_ONCE(lock-tickets.head); } while (--count); __ticket_lock_spinning(lock, inc.tail); } - barrier(); /* make sure nothing creeps before the lock is taken */ +out: barrier(); /* make sure nothing creeps before the lock is taken */ } static __always_inline int arch_spin_trylock(arch_spinlock_t *lock) -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 16/20] x86/ticketlock: don't inline _spin_unlock when using paravirt spinlocks
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com The code size expands somewhat, and its probably better to just call a function rather than inline it. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/Kconfig |3 +++ kernel/Kconfig.locks |2 +- 2 files changed, 4 insertions(+), 1 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index cea0cd9..e67bf49 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -577,6 +577,9 @@ config PARAVIRT_SPINLOCKS If you are unsure how to answer this question, answer N. +config ARCH_NOINLINE_SPIN_UNLOCK + def_bool PARAVIRT_SPINLOCKS + config PARAVIRT_CLOCK bool diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks index 88c92fb..3216c22 100644 --- a/kernel/Kconfig.locks +++ b/kernel/Kconfig.locks @@ -125,7 +125,7 @@ config INLINE_SPIN_LOCK_IRQSAVE ARCH_INLINE_SPIN_LOCK_IRQSAVE config INLINE_SPIN_UNLOCK - def_bool !DEBUG_SPINLOCK (!PREEMPT || ARCH_INLINE_SPIN_UNLOCK) + def_bool !DEBUG_SPINLOCK (!PREEMPT || ARCH_INLINE_SPIN_UNLOCK) !ARCH_NOINLINE_SPIN_UNLOCK config INLINE_SPIN_UNLOCK_BH def_bool !DEBUG_SPINLOCK ARCH_INLINE_SPIN_UNLOCK_BH -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 15/20] x86/ticketlock: prevent compiler reordering into locked region
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Add a barrier() at the end of __raw_spin_unlock() to prevent instructions from after the locked region from being reordered into it. In theory doing so shouldn't cause any problems, but in practice, the system locks up under load... Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/spinlock.h |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index d6de5e7..158b330 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -189,6 +189,8 @@ static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) next = lock-tickets.head + 1; __ticket_unlock_release(lock); __ticket_unlock_kick(lock, next); + + barrier(); /* prevent reordering into locked region */ } static inline int arch_spin_is_locked(arch_spinlock_t *lock) -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 07/20] x86/spinlocks: replace pv spinlocks with pv ticketlocks
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Rather than outright replacing the entire spinlock implementation in order to paravirtualize it, keep the ticket lock implementation but add a couple of pvops hooks on the slow patch (long spin on lock, unlocking a contended lock). Ticket locks have a number of nice properties, but they also have some surprising behaviours in virtual environments. They enforce a strict FIFO ordering on cpus trying to take a lock; however, if the hypervisor scheduler does not schedule the cpus in the correct order, the system can waste a huge amount of time spinning until the next cpu can take the lock. (See Thomas Friebel's talk Prevent Guests from Spinning Around http://www.xen.org/files/xensummitboston08/LHP.pdf for more details.) To address this, we add two hooks: - __ticket_spin_lock which is called after the cpu has been spinning on the lock for a significant number of iterations but has failed to take the lock (presumably because the cpu holding the lock has been descheduled). The lock_spinning pvop is expected to block the cpu until it has been kicked by the current lock holder. - __ticket_spin_unlock, which on releasing a contended lock (there are more cpus with tail tickets), it looks to see if the next cpu is blocked and wakes it if so. When compiled with CONFIG_PARAVIRT_SPINLOCKS disabled, a set of stub functions causes all the extra code to go away. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/paravirt.h | 30 +++--- arch/x86/include/asm/paravirt_types.h |8 + arch/x86/include/asm/spinlock.h | 44 +++-- arch/x86/kernel/paravirt-spinlocks.c | 15 +- arch/x86/xen/spinlock.c |7 - 5 files changed, 50 insertions(+), 54 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 5653f43..79c77e5 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -722,36 +722,14 @@ static inline void __set_fixmap(unsigned /* enum fixed_addresses */ idx, #if defined(CONFIG_SMP) defined(CONFIG_PARAVIRT_SPINLOCKS) -static inline int arch_spin_is_locked(struct arch_spinlock *lock) +static inline void __ticket_lock_spinning(struct arch_spinlock *lock, unsigned ticket) { - return PVOP_CALL1(int, pv_lock_ops.spin_is_locked, lock); + PVOP_VCALL2(pv_lock_ops.lock_spinning, lock, ticket); } -static inline int arch_spin_is_contended(struct arch_spinlock *lock) +static inline void ticket_unlock_kick(struct arch_spinlock *lock, unsigned ticket) { - return PVOP_CALL1(int, pv_lock_ops.spin_is_contended, lock); -} -#define arch_spin_is_contended arch_spin_is_contended - -static __always_inline void arch_spin_lock(struct arch_spinlock *lock) -{ - PVOP_VCALL1(pv_lock_ops.spin_lock, lock); -} - -static __always_inline void arch_spin_lock_flags(struct arch_spinlock *lock, - unsigned long flags) -{ - PVOP_VCALL2(pv_lock_ops.spin_lock_flags, lock, flags); -} - -static __always_inline int arch_spin_trylock(struct arch_spinlock *lock) -{ - return PVOP_CALL1(int, pv_lock_ops.spin_trylock, lock); -} - -static __always_inline void arch_spin_unlock(struct arch_spinlock *lock) -{ - PVOP_VCALL1(pv_lock_ops.spin_unlock, lock); + PVOP_VCALL2(pv_lock_ops.unlock_kick, lock, ticket); } #endif diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index db9ef55..968ee31 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -316,12 +316,8 @@ struct pv_mmu_ops { struct arch_spinlock; struct pv_lock_ops { - int (*spin_is_locked)(struct arch_spinlock *lock); - int (*spin_is_contended)(struct arch_spinlock *lock); - void (*spin_lock)(struct arch_spinlock *lock); - void (*spin_lock_flags)(struct arch_spinlock *lock, unsigned long flags); - int (*spin_trylock)(struct arch_spinlock *lock); - void (*spin_unlock)(struct arch_spinlock *lock); + void (*lock_spinning)(struct arch_spinlock *lock, unsigned ticket); + void (*unlock_kick)(struct arch_spinlock *lock, unsigned ticket); }; /* This contains all the paravirt structures: we get a convenient diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 507f83c..06815f3 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -52,6 +52,21 @@ static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock) } #endif +/* How long a lock should spin before we consider blocking */ +#define SPIN_THRESHOLD (1 11) + +#ifndef CONFIG_PARAVIRT_SPINLOCKS + +static __always_inline void __ticket_lock_spinning(struct arch_spinlock *lock, unsigned ticket) +{ +} + +static __always_inline void ticket_unlock_kick
[PATCH 02/20] x86/ticketlock: convert spin loop to C
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com The inner loop of __ticket_spin_lock isn't doing anything very special, so reimplement it in C. For the 8 bit ticket lock variant, we use a register union to get direct access to the lower and upper bytes in the tickets, but unfortunately gcc won't generate a direct comparison between the two halves of the register, so the generated asm isn't quite as pretty as the hand-coded version. However benchmarking shows that this is actually a small improvement in runtime performance on some benchmarks, and never a slowdown. We also need to make sure there's a barrier at the end of the lock loop to make sure that the compiler doesn't move any instructions from within the locked region into the region where we don't yet own the lock. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/spinlock.h | 58 +++--- 1 files changed, 29 insertions(+), 29 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index d6d5784..6711d36 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -58,21 +58,21 @@ #if (NR_CPUS 256) static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) { - unsigned short inc = 1 TICKET_SHIFT; - - asm volatile ( - LOCK_PREFIX xaddw %w0, %1\n - 1:\t - cmpb %h0, %b0\n\t - je 2f\n\t - rep ; nop\n\t - movb %1, %b0\n\t - /* don't need lfence here, because loads are in-order */ - jmp 1b\n - 2: - : +Q (inc), +m (lock-slock) - : - : memory, cc); + register union { + struct __raw_tickets tickets; + unsigned short slock; + } inc = { .slock = 1 TICKET_SHIFT }; + + asm volatile (LOCK_PREFIX xaddw %w0, %1\n + : +Q (inc), +m (lock-slock) : : memory, cc); + + for (;;) { + if (inc.tickets.head == inc.tickets.tail) + return; + cpu_relax(); + inc.tickets.head = ACCESS_ONCE(lock-tickets.head); + } + barrier(); /* make sure nothing creeps before the lock is taken */ } static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) @@ -105,22 +105,22 @@ static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock) static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) { unsigned inc = 1 TICKET_SHIFT; - unsigned tmp; + __ticket_t tmp; - asm volatile(LOCK_PREFIX xaddl %0, %1\n -movzwl %w0, %2\n\t -shrl $16, %0\n\t -1:\t -cmpl %0, %2\n\t -je 2f\n\t -rep ; nop\n\t -movzwl %1, %2\n\t -/* don't need lfence here, because loads are in-order */ -jmp 1b\n -2: -: +r (inc), +m (lock-slock), =r (tmp) -: -: memory, cc); + asm volatile(LOCK_PREFIX xaddl %0, %1\n\t +: +r (inc), +m (lock-slock) +: : memory, cc); + + tmp = inc; + inc = TICKET_SHIFT; + + for (;;) { + if ((__ticket_t)inc == tmp) + return; + cpu_relax(); + tmp = ACCESS_ONCE(lock-tickets.head); + } + barrier(); /* make sure nothing creeps before the lock is taken */ } static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 14/20] x86/ticketlock: loosen ordering restraints on unlock
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com It's only necessary to prevent the compiler from reordering code out of the locked region past the unlock. Putting too many barriers in prevents the compiler from generating the best code when PV spinlocks are enabled. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/spinlock.h |8 +--- 1 files changed, 5 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index b9a1aae..d6de5e7 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -46,9 +46,7 @@ static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock) #else static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock) { - barrier(); lock-tickets.head++; - barrier(); } #endif @@ -184,7 +182,11 @@ static __always_inline int arch_spin_trylock(arch_spinlock_t *lock) static __always_inline void arch_spin_unlock(arch_spinlock_t *lock) { - __ticket_t next = lock-tickets.head + 1; + __ticket_t next; + + barrier(); /* prevent reordering out of locked region */ + + next = lock-tickets.head + 1; __ticket_unlock_release(lock); __ticket_unlock_kick(lock, next); } -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 20/20] x86/ticketlock: rename ticketpair to head_tail
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Make it clearer what fields head_tail is actually overlapping with. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/spinlock.h |2 +- arch/x86/include/asm/spinlock_types.h |4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index a16b6e4..6850884 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -174,7 +174,7 @@ static __always_inline int arch_spin_trylock(arch_spinlock_t *lock) new.slock = tmp.slock + (1 TICKET_SHIFT); - ret = cmpxchg(lock-ticketpair, tmp.slock, new.slock) == tmp.slock; + ret = cmpxchg(lock-head_tail, tmp.slock, new.slock) == tmp.slock; barrier(); /* just make nothing creeps before lock is claimed */ return ret; diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h index 307ef0b..df031ec 100644 --- a/arch/x86/include/asm/spinlock_types.h +++ b/arch/x86/include/asm/spinlock_types.h @@ -24,7 +24,7 @@ typedef struct arch_spinlock { struct { __ticket_t waiting; union { - __ticketpair_t ticketpair; + __ticketpair_t head_tail; struct __raw_tickets { __ticket_t head, tail; } tickets; @@ -38,7 +38,7 @@ typedef struct arch_spinlock { #else typedef struct arch_spinlock { union { - __ticketpair_t ticketpair; + __ticketpair_t head_tail; struct __raw_tickets { __ticket_t head, tail; } tickets; -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 10/20] x86/pvticketlock: keep count of blocked cpus
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com When a CPU blocks by calling into __ticket_lock_spinning, keep a count in the spinlock. This allows __ticket_lock_kick to more accurately tell whether it has any work to do (in many cases, a spinlock may be contended, but none of the waiters have gone into blocking). This adds two locked instructions to the spinlock slow path (once the lock has already spun for SPIN_THRESHOLD iterations), and adds another one or two bytes to struct arch_spinlock. We need to make sure we increment the waiting counter before doing the last-chance check of the lock to see if we picked it up in the meantime. If we don't then there's a potential deadlock: lock holder lock waiter clear event channel check lock for pickup (did not) release lock check waiting counter (=0, no kick) add waiting counter block (=deadlock) Moving the add waiting counter earler avoids the deadlock: lock holder lock waiter clear event channel add waiting counter check lock for pickup (did not) release lock check waiting counter (=1, kick) block (and immediately wake) Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/spinlock.h | 27 ++- arch/x86/include/asm/spinlock_types.h |3 +++ arch/x86/xen/spinlock.c |4 3 files changed, 33 insertions(+), 1 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index a79dfee..3deabca 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -65,6 +65,31 @@ static __always_inline void ticket_unlock_kick(struct arch_spinlock *lock, u { } +static __always_inline bool __ticket_lock_waiters(const struct arch_spinlock *lock) +{ + return false; +} +#else +static inline void __ticket_add_waiting(struct arch_spinlock *lock) +{ + if (sizeof(lock-waiting) == sizeof(u8)) + asm (LOCK_PREFIX addb $1, %0 : +m (lock-waiting) : : memory); + else + asm (LOCK_PREFIX addw $1, %0 : +m (lock-waiting) : : memory); +} + +static inline void __ticket_sub_waiting(struct arch_spinlock *lock) +{ + if (sizeof(lock-waiting) == sizeof(u8)) + asm (LOCK_PREFIX subb $1, %0 : +m (lock-waiting) : : memory); + else + asm (LOCK_PREFIX subw $1, %0 : +m (lock-waiting) : : memory); +} + +static __always_inline bool __ticket_lock_waiters(const struct arch_spinlock *lock) +{ + return ACCESS_ONCE(lock-waiting) != 0; +} #endif /* CONFIG_PARAVIRT_SPINLOCKS */ /* @@ -106,7 +131,7 @@ static __always_inline struct __raw_tickets __ticket_spin_claim(struct arch_spin */ static __always_inline void __ticket_unlock_kick(struct arch_spinlock *lock, __ticket_t next) { - if (unlikely(lock-tickets.tail != next)) + if (unlikely(__ticket_lock_waiters(lock))) ticket_unlock_kick(lock, next); } diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h index 48dafc3..b396ed5 100644 --- a/arch/x86/include/asm/spinlock_types.h +++ b/arch/x86/include/asm/spinlock_types.h @@ -26,6 +26,9 @@ typedef struct arch_spinlock { __ticket_t head, tail; } tickets; }; +#ifdef CONFIG_PARAVIRT_SPINLOCKS + __ticket_t waiting; +#endif } arch_spinlock_t; #define __ARCH_SPIN_LOCK_UNLOCKED { { .slock = 0 } } diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index 5feb897..b8f09d5 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -119,6 +119,8 @@ static void xen_lock_spinning(struct arch_spinlock *lock, unsigned want) /* Only check lock once pending cleared */ barrier(); + __ticket_add_waiting(lock); + /* check again make sure it didn't become free while we weren't looking */ if (ACCESS_ONCE(lock-tickets.head) == want) { @@ -133,6 +135,8 @@ static void xen_lock_spinning(struct arch_spinlock *lock, unsigned want) kstat_incr_irqs_this_cpu(irq, irq_to_desc(irq)); out: + __ticket_sub_waiting(lock); + cpumask_clear_cpu(cpu, waiting_cpus); w-lock = NULL; spin_time_accum_blocked(start); -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 19/20] x86/ticketlocks: use overlapping read to eliminate mb()
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com When reading the 'waiting' counter, use a longer-than-necessary read which also overlaps 'head'. This read is guaranteed to be in-order with respect to and unlock writes to 'head', thereby eliminating the need for an explicit mb() to enforce the read-after-write ordering. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/spinlock.h | 14 +++--- arch/x86/include/asm/spinlock_types.h | 24 2 files changed, 27 insertions(+), 11 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index c7455e1..a16b6e4 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -86,14 +86,14 @@ static inline void __ticket_sub_waiting(struct arch_spinlock *lock) static __always_inline bool __ticket_lock_waiters(const struct arch_spinlock *lock) { - /* -* Make sure everyone sees the unlock write before we check the -* waiting count. The processor ordering doesn't guarantee this -* because they're different memory locations. + /* +* lock-waiting_head is a union element which aliases both +* 'waiting' and 'head'. Since the memory access overlaps +* 'head', the read is forced to be in-order with respect to +* unlock writes to 'head', eliminating the need for an +* explicit mb(). (Intel memory ordering rules.) */ - mb(); - - return ACCESS_ONCE(lock-waiting) != 0; + return ((__ticket_t)ACCESS_ONCE(lock-waiting_head)) != 0; } #endif /* CONFIG_PARAVIRT_SPINLOCKS */ diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h index def8010..307ef0b 100644 --- a/arch/x86/include/asm/spinlock_types.h +++ b/arch/x86/include/asm/spinlock_types.h @@ -18,6 +18,24 @@ typedef u32 __ticketpair_t; #define TICKET_SHIFT (sizeof(__ticket_t) * 8) #define TICKET_MASK((1 TICKET_SHIFT) - 1) +#ifdef CONFIG_PARAVIRT_SPINLOCKS +typedef struct arch_spinlock { + union { + struct { + __ticket_t waiting; + union { + __ticketpair_t ticketpair; + struct __raw_tickets { + __ticket_t head, tail; + } tickets; + }; + }; + __ticketpair_t waiting_head; /* aliases waiting and head */ + }; +} arch_spinlock_t; + +#define __ARCH_SPIN_LOCK_UNLOCKED { { { 0, { 0 } } } } +#else typedef struct arch_spinlock { union { __ticketpair_t ticketpair; @@ -25,12 +43,10 @@ typedef struct arch_spinlock { __ticket_t head, tail; } tickets; }; -#ifdef CONFIG_PARAVIRT_SPINLOCKS - __ticket_t waiting; -#endif } arch_spinlock_t; -#define __ARCH_SPIN_LOCK_UNLOCKED { { .tickets = {0, 0} } } +#define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } } +#endif typedef struct { unsigned int lock; -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 09/20] xen/pvticketlock: Xen implementation for PV ticket locks
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Replace the old Xen implementation of PV spinlocks with and implementation of xen_lock_spinning and xen_unlock_kick. xen_lock_spinning simply registers the cpu in its entry in lock_waiting, adds itself to the waiting_cpus set, and blocks on an event channel until the channel becomes pending. xen_unlock_kick searches the cpus in waiting_cpus looking for the one which next wants this lock with the next ticket, if any. If found, it kicks it by making its event channel pending, which wakes it up. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/xen/spinlock.c | 281 ++- 1 files changed, 36 insertions(+), 245 deletions(-) diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index 7e9eb52..5feb897 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -19,32 +19,21 @@ #ifdef CONFIG_XEN_DEBUG_FS static struct xen_spinlock_stats { - u64 taken; u32 taken_slow; - u32 taken_slow_nested; u32 taken_slow_pickup; u32 taken_slow_spurious; - u32 taken_slow_irqenable; - u64 released; u32 released_slow; u32 released_slow_kicked; #define HISTO_BUCKETS 30 - u32 histo_spin_total[HISTO_BUCKETS+1]; - u32 histo_spin_spinning[HISTO_BUCKETS+1]; u32 histo_spin_blocked[HISTO_BUCKETS+1]; - u64 time_total; - u64 time_spinning; u64 time_blocked; } spinlock_stats; static u8 zero_stats; -static unsigned lock_timeout = 1 10; -#define TIMEOUT lock_timeout - static inline void check_zero(void) { if (unlikely(zero_stats)) { @@ -73,22 +62,6 @@ static void __spin_time_accum(u64 delta, u32 *array) array[HISTO_BUCKETS]++; } -static inline void spin_time_accum_spinning(u64 start) -{ - u32 delta = xen_clocksource_read() - start; - - __spin_time_accum(delta, spinlock_stats.histo_spin_spinning); - spinlock_stats.time_spinning += delta; -} - -static inline void spin_time_accum_total(u64 start) -{ - u32 delta = xen_clocksource_read() - start; - - __spin_time_accum(delta, spinlock_stats.histo_spin_total); - spinlock_stats.time_total += delta; -} - static inline void spin_time_accum_blocked(u64 start) { u32 delta = xen_clocksource_read() - start; @@ -105,214 +78,76 @@ static inline u64 spin_time_start(void) return 0; } -static inline void spin_time_accum_total(u64 start) -{ -} -static inline void spin_time_accum_spinning(u64 start) -{ -} static inline void spin_time_accum_blocked(u64 start) { } #endif /* CONFIG_XEN_DEBUG_FS */ -struct xen_spinlock { - unsigned char lock; /* 0 - free; 1 - locked */ - unsigned short spinners;/* count of waiting cpus */ +struct xen_lock_waiting { + struct arch_spinlock *lock; + __ticket_t want; }; static DEFINE_PER_CPU(int, lock_kicker_irq) = -1; +static DEFINE_PER_CPU(struct xen_lock_waiting, lock_waiting); +static cpumask_t waiting_cpus; -#if 0 -static int xen_spin_is_locked(struct arch_spinlock *lock) -{ - struct xen_spinlock *xl = (struct xen_spinlock *)lock; - - return xl-lock != 0; -} - -static int xen_spin_is_contended(struct arch_spinlock *lock) -{ - struct xen_spinlock *xl = (struct xen_spinlock *)lock; - - /* Not strictly true; this is only the count of contended - lock-takers entering the slow path. */ - return xl-spinners != 0; -} - -static int xen_spin_trylock(struct arch_spinlock *lock) -{ - struct xen_spinlock *xl = (struct xen_spinlock *)lock; - u8 old = 1; - - asm(xchgb %b0,%1 - : +q (old), +m (xl-lock) : : memory); - - return old == 0; -} - -static DEFINE_PER_CPU(struct xen_spinlock *, lock_spinners); - -/* - * Mark a cpu as interested in a lock. Returns the CPU's previous - * lock of interest, in case we got preempted by an interrupt. - */ -static inline struct xen_spinlock *spinning_lock(struct xen_spinlock *xl) -{ - struct xen_spinlock *prev; - - prev = __get_cpu_var(lock_spinners); - __get_cpu_var(lock_spinners) = xl; - - wmb(); /* set lock of interest before count */ - - asm(LOCK_PREFIX incw %0 - : +m (xl-spinners) : : memory); - - return prev; -} - -/* - * Mark a cpu as no longer interested in a lock. Restores previous - * lock of interest (NULL for none). - */ -static inline void unspinning_lock(struct xen_spinlock *xl, struct xen_spinlock *prev) -{ - asm(LOCK_PREFIX decw %0 - : +m (xl-spinners) : : memory); - wmb(); /* decrement count before restoring lock */ - __get_cpu_var(lock_spinners) = prev; -} - -static noinline int xen_spin_lock_slow(struct arch_spinlock *lock, bool irq_enable) +static void xen_lock_spinning(struct arch_spinlock *lock, unsigned want) { - struct xen_spinlock *xl
[PATCH 05/20] x86/ticketlock: make __ticket_spin_lock common
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Aside from the particular form of the xadd instruction, they're identical. So factor out the xadd and use common code for the rest. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/spinlock.h | 42 ++ 1 files changed, 20 insertions(+), 22 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 7586d7a..4f9fa24 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -69,13 +69,27 @@ static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock) * save some instructions and make the code more elegant. There really isn't * much between them in performance though, especially as locks are out of line. */ -#if (NR_CPUS 256) -static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) +static __always_inline struct __raw_tickets __ticket_spin_claim(struct arch_spinlock *lock) { - register struct __raw_tickets inc = { .tail = 1 }; + register struct __raw_tickets tickets = { .tail = 1 }; + + if (sizeof(lock-tickets.head) == sizeof(u8)) + asm volatile (LOCK_PREFIX xaddw %w0, %1\n + : +r (tickets), +m (lock-tickets) + : : memory, cc); + else + asm volatile (LOCK_PREFIX xaddl %0, %1\n +: +r (tickets), +m (lock-tickets) +: : memory, cc); - asm volatile (LOCK_PREFIX xaddw %w0, %1\n - : +r (inc), +m (lock-tickets) : : memory, cc); + return tickets; +} + +static __always_inline void __ticket_spin_lock(struct arch_spinlock *lock) +{ + register struct __raw_tickets inc; + + inc = __ticket_spin_claim(lock); for (;;) { if (inc.head == inc.tail) @@ -86,6 +100,7 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) barrier(); /* make sure nothing creeps before the lock is taken */ } +#if (NR_CPUS 256) static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) { unsigned int tmp, new; @@ -105,23 +120,6 @@ static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) return tmp; } #else -static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) -{ - register struct __raw_tickets inc = { .tickets.tail = 1 }; - - asm volatile(LOCK_PREFIX xaddl %0, %1\n\t -: +r (inc), +m (lock-tickets) -: : memory, cc); - - for (;;) { - if (inc.head == inc.tail) - return; - cpu_relax(); - inc.head = ACCESS_ONCE(lock-tickets.head); - } - barrier(); /* make sure nothing creeps before the lock is taken */ -} - static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) { unsigned tmp; -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 02/20] x86/ticketlock: convert spin loop to C
On 11/03/2010 11:11 AM, Eric Dumazet wrote: Le mercredi 03 novembre 2010 à 10:59 -0400, Jeremy Fitzhardinge a écrit : From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com The inner loop of __ticket_spin_lock isn't doing anything very special, so reimplement it in C. For the 8 bit ticket lock variant, we use a register union to get direct access to the lower and upper bytes in the tickets, but unfortunately gcc won't generate a direct comparison between the two halves of the register, so the generated asm isn't quite as pretty as the hand-coded version. However benchmarking shows that this is actually a small improvement in runtime performance on some benchmarks, and never a slowdown. We also need to make sure there's a barrier at the end of the lock loop to make sure that the compiler doesn't move any instructions from within the locked region into the region where we don't yet own the lock. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/spinlock.h | 58 +++--- 1 files changed, 29 insertions(+), 29 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index d6d5784..6711d36 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -58,21 +58,21 @@ #if (NR_CPUS 256) static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) { -unsigned short inc = 1 TICKET_SHIFT; - -asm volatile ( -LOCK_PREFIX xaddw %w0, %1\n -1:\t -cmpb %h0, %b0\n\t -je 2f\n\t -rep ; nop\n\t -movb %1, %b0\n\t -/* don't need lfence here, because loads are in-order */ -jmp 1b\n -2: -: +Q (inc), +m (lock-slock) -: -: memory, cc); +register union { +struct __raw_tickets tickets; +unsigned short slock; +} inc = { .slock = 1 TICKET_SHIFT }; + +asm volatile (LOCK_PREFIX xaddw %w0, %1\n + : +Q (inc), +m (lock-slock) : : memory, cc); + +for (;;) { +if (inc.tickets.head == inc.tickets.tail) +return; +cpu_relax(); +inc.tickets.head = ACCESS_ONCE(lock-tickets.head); +} +barrier(); /* make sure nothing creeps before the lock is taken */ Isnt this barrier() never reached ? Sorry, a later patch makes this clearer. I should have folded it in. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH 18/20] x86/ticketlock: remove .slock
From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/spinlock_types.h |3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h index b396ed5..def8010 100644 --- a/arch/x86/include/asm/spinlock_types.h +++ b/arch/x86/include/asm/spinlock_types.h @@ -20,7 +20,6 @@ typedef u32 __ticketpair_t; typedef struct arch_spinlock { union { - unsigned int slock; __ticketpair_t ticketpair; struct __raw_tickets { __ticket_t head, tail; @@ -31,7 +30,7 @@ typedef struct arch_spinlock { #endif } arch_spinlock_t; -#define __ARCH_SPIN_LOCK_UNLOCKED { { .slock = 0 } } +#define __ARCH_SPIN_LOCK_UNLOCKED { { .tickets = {0, 0} } } typedef struct { unsigned int lock; -- 1.7.2.3 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 03/20] x86/ticketlock: Use C for __ticket_spin_unlock
On 11/03/2010 11:13 AM, Eric Dumazet wrote: Le mercredi 03 novembre 2010 à 10:59 -0400, Jeremy Fitzhardinge a écrit : From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com If we don't need to use a locked inc for unlock, then implement it in C. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com --- arch/x86/include/asm/spinlock.h | 33 ++--- 1 files changed, 18 insertions(+), 15 deletions(-) diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 6711d36..082990a 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -33,9 +33,23 @@ * On PPro SMP or if we are using OOSTORE, we use a locked operation to unlock * (PPro errata 66, 92) */ -# define UNLOCK_LOCK_PREFIX LOCK_PREFIX +static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock) +{ +if (sizeof(lock-tickets.head) == sizeof(u8)) +asm (LOCK_PREFIX incb %0 + : +m (lock-tickets.head) : : memory); +else +asm (LOCK_PREFIX incw %0 + : +m (lock-tickets.head) : : memory); + +} #else -# define UNLOCK_LOCK_PREFIX +static __always_inline void __ticket_unlock_release(struct arch_spinlock *lock) +{ +barrier(); technically speaking, it should be : smp_wmb() Perhaps. In practise it won't make a difference because it is defined as barrier() unless OOSTORE is defined, in which case we need to do a locked increment anyway. Thanks, J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] xen: xenfs: privcmd: check put_user() return code
On 10/28/2010 04:39 AM, Vasiliy Kulikov wrote: put_user() may fail. In this case propagate error code from privcmd_ioctl_mmap_batch(). Thanks for looking at this. I'm in two minds about this; the existing logic is such that these put_users can only fail if something else has already failed and its returning an error. I guess it would be useful to get an EFAULT if you've got a problem writing back the results. IanC, any opinion? Thanks, J Signed-off-by: Vasiliy Kulikov sego...@gmail.com --- Compile tested. drivers/xen/xenfs/privcmd.c |8 ++-- 1 files changed, 2 insertions(+), 6 deletions(-) diff --git a/drivers/xen/xenfs/privcmd.c b/drivers/xen/xenfs/privcmd.c index f80be7f..2eb04c8 100644 --- a/drivers/xen/xenfs/privcmd.c +++ b/drivers/xen/xenfs/privcmd.c @@ -266,9 +266,7 @@ static int mmap_return_errors(void *data, void *state) xen_pfn_t *mfnp = data; struct mmap_batch_state *st = state; - put_user(*mfnp, st-user++); - - return 0; + return put_user(*mfnp, st-user++); } static struct vm_operations_struct privcmd_vm_ops; @@ -323,10 +321,8 @@ static long privcmd_ioctl_mmap_batch(void __user *udata) up_write(mm-mmap_sem); if (state.err 0) { - ret = 0; - state.user = m.arr; - traverse_pages(m.num, sizeof(xen_pfn_t), + ret = traverse_pages(m.num, sizeof(xen_pfn_t), pagelist, mmap_return_errors, state); } ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH] xen: xenfs: privcmd: check put_user() return code
On 10/29/2010 10:44 AM, Ian Campbell wrote: On Fri, 2010-10-29 at 18:18 +0100, Jeremy Fitzhardinge wrote: On 10/28/2010 04:39 AM, Vasiliy Kulikov wrote: put_user() may fail. In this case propagate error code from privcmd_ioctl_mmap_batch(). Thanks for looking at this. I'm in two minds about this; the existing logic is such that these put_users can only fail if something else has already failed and its returning an error. I guess it would be useful to get an EFAULT if you've got a problem writing back the results. IanC, any opinion? Not a strong one. Perhaps what we really want in this case is for traverse_pages to return the total number of callback failures it encountered rather than aborting after the first failure? On the other hand you are correct that gather_array() has already touched all the pages which we are going to be touching here so how likely is a new failure at this point anyway? I could think of two cases: the array is mapped RO, so only the writeback fails, or someone changes the mapping under our feet from another thread. J Ian. Thanks, J Signed-off-by: Vasiliy Kulikov sego...@gmail.com --- Compile tested. drivers/xen/xenfs/privcmd.c |8 ++-- 1 files changed, 2 insertions(+), 6 deletions(-) diff --git a/drivers/xen/xenfs/privcmd.c b/drivers/xen/xenfs/privcmd.c index f80be7f..2eb04c8 100644 --- a/drivers/xen/xenfs/privcmd.c +++ b/drivers/xen/xenfs/privcmd.c @@ -266,9 +266,7 @@ static int mmap_return_errors(void *data, void *state) xen_pfn_t *mfnp = data; struct mmap_batch_state *st = state; - put_user(*mfnp, st-user++); - - return 0; + return put_user(*mfnp, st-user++); } static struct vm_operations_struct privcmd_vm_ops; @@ -323,10 +321,8 @@ static long privcmd_ioctl_mmap_batch(void __user *udata) up_write(mm-mmap_sem); if (state.err 0) { - ret = 0; - state.user = m.arr; - traverse_pages(m.num, sizeof(xen_pfn_t), + ret = traverse_pages(m.num, sizeof(xen_pfn_t), pagelist, mmap_return_errors, state); } ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] Re: [PATCH] x86/pvclock-xen: zero last_value on resume
On 10/26/2010 10:48 AM, Glauber Costa wrote: On Tue, 2010-10-26 at 09:59 -0700, Jeremy Fitzhardinge wrote: If the guest domain has been suspend/resumed or migrated, then the system clock backing the pvclock clocksource may revert to a smaller value (ie, can be non-monotonic across the migration/save-restore). Make sure we zero last_value in that case so that the domain continues to see clock updates. [ I don't know if kvm needs an analogous fix or not. ] After migration, save/restore, etc, we issue an ioctl where we tell the host the last clock value. That (in theory) guarantees monotonicity. I am not opposed to this patch in any way, however. Thanks. HPA, do you want to take this, or shall I send it on? Thanks, J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
[PATCH] x86/pvclock-xen: zero last_value on resume
If the guest domain has been suspend/resumed or migrated, then the system clock backing the pvclock clocksource may revert to a smaller value (ie, can be non-monotonic across the migration/save-restore). Make sure we zero last_value in that case so that the domain continues to see clock updates. [ I don't know if kvm needs an analogous fix or not. ] Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Cc: Stable Kernel sta...@kernel.org Reported-by: Eelco Dolstra e.dols...@tudelft.nl Reported-by: Olivier Hanesse olivier.hane...@gmail.com Bisected-by: Cédric Schieli cschi...@gmail.com Tested-by: Cédric Schieli cschi...@gmail.com diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h index cd02f32..6226870 100644 --- a/arch/x86/include/asm/pvclock.h +++ b/arch/x86/include/asm/pvclock.h @@ -11,5 +11,6 @@ unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src); void pvclock_read_wallclock(struct pvclock_wall_clock *wall, struct pvclock_vcpu_time_info *vcpu, struct timespec *ts); +void pvclock_resume(void); #endif /* _ASM_X86_PVCLOCK_H */ diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c index 239427c..a4f07c1 100644 --- a/arch/x86/kernel/pvclock.c +++ b/arch/x86/kernel/pvclock.c @@ -120,6 +120,11 @@ unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src) static atomic64_t last_value = ATOMIC64_INIT(0); +void pvclock_resume(void) +{ + atomic64_set(last_value, 0); +} + cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src) { struct pvclock_shadow_time shadow; diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c index b2bb5aa..5da5e53 100644 --- a/arch/x86/xen/time.c +++ b/arch/x86/xen/time.c @@ -426,6 +426,8 @@ void xen_timer_resume(void) { int cpu; + pvclock_resume(); + if (xen_clockevent != xen_vcpuop_clockevent) return; From 29acbb4e1d93e719250648db1ce8c7a24144fd86 Mon Sep 17 00:00:00 2001 From: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Date: Mon, 25 Oct 2010 16:53:46 -0700 Subject: [PATCH] x86/pvclock: zero last_value on resume If the guest domain has been suspend/resumed or migrated, then the system clock backing the pvclock clocksource may revert to a smaller value (ie, can be non-monotonic across the migration/save-restore). Make sure we zero last_value in that case so that the domain continues to see clock updates. Signed-off-by: Jeremy Fitzhardinge jeremy.fitzhardi...@citrix.com Cc: Stable Kernel sta...@kernel.org Reported-by: Eelco Dolstra e.dols...@tudelft.nl Reported-by: Olivier Hanesse olivier.hane...@gmail.com Bisected-by: Cédric Schieli cschi...@gmail.com Tested-by: Cédric Schieli cschi...@gmail.com diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h index cd02f32..6226870 100644 --- a/arch/x86/include/asm/pvclock.h +++ b/arch/x86/include/asm/pvclock.h @@ -11,5 +11,6 @@ unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src); void pvclock_read_wallclock(struct pvclock_wall_clock *wall, struct pvclock_vcpu_time_info *vcpu, struct timespec *ts); +void pvclock_resume(void); #endif /* _ASM_X86_PVCLOCK_H */ diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c index 239427c..a4f07c1 100644 --- a/arch/x86/kernel/pvclock.c +++ b/arch/x86/kernel/pvclock.c @@ -120,6 +120,11 @@ unsigned long pvclock_tsc_khz(struct pvclock_vcpu_time_info *src) static atomic64_t last_value = ATOMIC64_INIT(0); +void pvclock_resume(void) +{ + atomic64_set(last_value, 0); +} + cycle_t pvclock_clocksource_read(struct pvclock_vcpu_time_info *src) { struct pvclock_shadow_time shadow; diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c index b2bb5aa..5da5e53 100644 --- a/arch/x86/xen/time.c +++ b/arch/x86/xen/time.c @@ -426,6 +426,8 @@ void xen_timer_resume(void) { int cpu; + pvclock_resume(); + if (xen_clockevent != xen_vcpuop_clockevent) return; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: xen PV on HVM and initial domain merge in linux-next
Yes all ok. J Stefano Stabellini stefano.stabell...@eu.citrix.com wrote: On Wed, 20 Oct 2010, Stephen Rothwell wrote: Hi Stefano, [just casting the net a bit wider ...] On Tue, 19 Oct 2010 18:51:47 +0100 Stefano Stabellini stefano.stabell...@eu.citrix.com wrote: I forgot to CC the LKML and linux-next... On Tue, 19 Oct 2010, Stefano Stabellini wrote: Stephen, I have two patch series to merge in linux-next: PV on HVM: receive interrupts as xen events xen: initial domain support they have all the acked-by needed and are both stable since several weeks, however they depend on Konrad's xen-pcifront series and for this reason I waited until now to ask for a merge in linux-next. Could you please pull: git://xenbits.xen.org/people/sstabellini/linux-pvhvm.git linux-next-initial-domain-v4 it contains both series rebased on Konrad's pcifront series merged on linux-next (warning: it still contains the merge commit of xen-pcifront-0.8.2 in linux-next). Let me know if you have any conflicts or if you need me to change the branch somehow. Not following the Xen develpment at all, I would like to have a positive reply from the listed Xen contacts, please, Sure. Jeremy? I do have concerns that this is turning up so late, but I realise that that is mainly due to a misunderstanding on the part of some of the Xen community. Thank you very much for understanding! Also, the above tree is based on next-20101019 which means that I cannot use it as is. All the trees merged into linux-next must be base on some other stable tree (almost always Linus' tree). linux-next is rebuilt from scratch every day, so I cannot ever include a previous day's version. Merging in other stable trees is OK (as long as the other maintainer is aware of that and makes sure that their tree does not reabse). Basically what you send to me should be what you intend to send to Linus during the next merge window. All right. I merged Jeremy's and Konrad's branches (the ones you just merged on linux-next) on top of linux 2.6.36 rc8, then I rebased my series on top of the result. Please checkout this branch: git://xenbits.xen.org/people/sstabellini/linux-pvhvm.git 2.6.36-rc8-initial-domain-v5 and let me know if it is suitable, it shouldn't have any merge conflicts. Cheers, Stefano ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] PROBLEM: [BISECTED] 2.6.35.5 xen domU panics just after the boot
On 09/23/2010 10:40 PM, Paweł Zuzelski wrote: Have you seen the message from James Dingwall? He had found the typo in fb412a178502dc498430723b082a932f797e4763 commit and sent the patch to lkml that fixes it. Please, see https://patchwork.kernel.org/patch/202282/ if you have not received James' mail. That patch resolves the problem for me and for James. (!) Oh, well. Yes, that would explain it. I'd noticed the missing 'x', but assumed I'd just done it with an editor fat-finger... Looks like upstream is OK, so it crept in with the stable cherry-pick/backport to both stable-2.6.32/master and stable-2.6.35/master. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [Xen-devel] PROBLEM: [BISECTED] 2.6.35.5 xen domU panics just after the boot
On 09/23/2010 07:13 PM, Dan Magenheimer wrote: Jeremy -- FYI, I think I've also seen this problem, or something similar, but ONLY on a Nehalem box (and only intermittently), not on my Core 2 Duo boxen. The Nehalem box is an SDP so I had assumed that it was something to do with that, but maybe not. Maybe some feature is leaking through to the guest. Hyperthreading? MSI? EPT? Anyway, if you have a newer box, you might try reproducing on that, rather than your usual development box (which IIRC was a Core 2 Duo laptop?) No, this bug turned out to be a simple typo in the patch that got merged into the upstream stable kernels. J -Original Message- From: Jeremy Fitzhardinge [mailto:jer...@goop.org] Sent: Thursday, September 23, 2010 6:02 PM To: Pawel Zuzelski Cc: virtualizat...@lists.osdl.org; Jeremy Fitzhardinge; xen- de...@lists.xensource.com; lkml Subject: Re: [Xen-devel] PROBLEM: [BISECTED] 2.6.35.5 xen domU panics just after the boot On 09/21/2010 12:05 PM, Paweł Zuzelski wrote: Hello, kernels 2.6.35.5 and 2.6.32.22 xen domU panics at the very begining of the boot process. I have bisected it to a single commit, and the first bad commit is: [fb412a178502dc498430723b082a932f797e4763] xen: use percpu interrupts for IPIs and VIRQs kernel v2.6.35.5 with reverted this commit works for me. Thanks very much for doing that. I'll have to work out what's going on (obviously it doesn't do it for me). J Here are the kernel configs I was using: http://carme.pld-linux.org/~pawelz/kernel-2.6.35.5-domU-config http://carme.pld-linux.org/~pawelz/kernel-2.6.32.22-domU-config As you can see they are stripped down configs, intended to run in domU only. I was testing it with the very simple domU configuration: kernel = '/srv/xen/bzImage' memory = '128' vcpus = 2 name= 'test' on_poweroff = 'destroy' on_reboot = 'restart' on_crash= 'restart' Here is the full output of kernel 2.6.35.5: Using config file /etc/xen/test. Started domain test [0.00] Policy zone: DMA32 [0.00] Kernel command line: [0.00] PID hash table entries: 512 (order: 0, 4096 bytes) [0.00] Subtract (33 early reservations) [0.00] #1 [0001976000 - 0001987000] XEN PAGETABLES [0.00] #2 [000100 - 00019125f8] TEXT DATA BSS [0.00] #3 [0001933000 - 0001976000] XEN START INFO [0.00] #4 [01 - 012000] TRAMPOLINE [0.00] #5 [012000 - 04] PGTABLE [0.00] #6 [0001912600 - 0001917600] NODE_DATA [0.00] #7 [0001917600 - 0001918600] BOOTMEM [0.00] #8 [0001918600 - 0001918618] BOOTMEM [0.00] #9 [0001919000 - 000191a000] BOOTMEM [0.00] #10 [000191a000 - 000191b000] BOOTMEM [0.00] #11 [000191b000 - 000191c000] BOOTMEM [0.00] #12 [000220 - 00023c]MEMMAP 0 [0.00] #13 [0001918640 - 00019187c0] BOOTMEM [0.00] #14 [000191c000 - 000191cc00] BOOTMEM [0.00] #15 [000191d000 - 000191e000] BOOTMEM [0.00] #16 [000191e000 - 000191f000] BOOTMEM [0.00] #17 [000191f000 - 000192] BOOTMEM [0.00] #18 [00019187c0 - 00019188a0] BOOTMEM [0.00] #19 [00019188c0 - 0001918928] BOOTMEM [0.00] #20 [0001918940 - 00019189a8] BOOTMEM [0.00] #21 [00019189c0 - 0001918a28] BOOTMEM [0.00] #22 [0001918a40 - 0001918a41] BOOTMEM [0.00] #23 [0001918a80 - 0001918a81] BOOTMEM [0.00] #24 [0001987000 - 00019c1000] BOOTMEM [0.00] #25 [0001918ac0 - 0001918ac8] BOOTMEM [0.00] #26 [0001918b00 - 0001918b08] BOOTMEM [0.00] #27 [0001918b40 - 0001918b48] BOOTMEM [0.00] #28 [0001918b80 - 0001918b90] BOOTMEM [0.00] #29 [0001918bc0 - 0001918cc0] BOOTMEM [0.00] #30 [0001918cc0 - 0001918d08] BOOTMEM [0.00] #31 [0001918d40 - 0001918d88] BOOTMEM [0.00] #32 [000192 - 0001921000] BOOTMEM [0.00] Memory: 118724k/131072k available (3327k kernel code, 448k absent, 11900k reserved, 3931k data, 440k init) [0.00] SLUB: Genslabs=14, HWalign=64, Order=0-3, MinObjects=0, CPUs=2, Nodes=1 [0.00] Hierarchical RCU implementation. [0.00] RCU-based detection of stalled CPUs is disabled. [0.00] Verbose stalled-CPUs detection is disabled. [0.00] NR_IRQS:2304 [0.00] Console: colour dummy device 80x25 [0.00] console [tty0] enabled [0.00] console [hvc0] enabled [0.00] installing Xen timer for CPU 0 [0.00] BUG: unable to handle kernel NULL pointer dereference at (null) [0.00] IP: [(null)] (null) [0.00
Re: [Xen-devel] PROBLEM: [BISECTED] 2.6.35.5 xen domU panics just after the boot
On 09/21/2010 12:05 PM, Paweł Zuzelski wrote: Hello, kernels 2.6.35.5 and 2.6.32.22 xen domU panics at the very begining of the boot process. I have bisected it to a single commit, and the first bad commit is: [fb412a178502dc498430723b082a932f797e4763] xen: use percpu interrupts for IPIs and VIRQs kernel v2.6.35.5 with reverted this commit works for me. Thanks very much for doing that. I'll have to work out what's going on (obviously it doesn't do it for me). J Here are the kernel configs I was using: http://carme.pld-linux.org/~pawelz/kernel-2.6.35.5-domU-config http://carme.pld-linux.org/~pawelz/kernel-2.6.32.22-domU-config As you can see they are stripped down configs, intended to run in domU only. I was testing it with the very simple domU configuration: kernel = '/srv/xen/bzImage' memory = '128' vcpus = 2 name= 'test' on_poweroff = 'destroy' on_reboot = 'restart' on_crash= 'restart' Here is the full output of kernel 2.6.35.5: Using config file /etc/xen/test. Started domain test [0.00] Policy zone: DMA32 [0.00] Kernel command line: [0.00] PID hash table entries: 512 (order: 0, 4096 bytes) [0.00] Subtract (33 early reservations) [0.00] #1 [0001976000 - 0001987000] XEN PAGETABLES [0.00] #2 [000100 - 00019125f8] TEXT DATA BSS [0.00] #3 [0001933000 - 0001976000] XEN START INFO [0.00] #4 [01 - 012000] TRAMPOLINE [0.00] #5 [012000 - 04] PGTABLE [0.00] #6 [0001912600 - 0001917600] NODE_DATA [0.00] #7 [0001917600 - 0001918600] BOOTMEM [0.00] #8 [0001918600 - 0001918618] BOOTMEM [0.00] #9 [0001919000 - 000191a000] BOOTMEM [0.00] #10 [000191a000 - 000191b000] BOOTMEM [0.00] #11 [000191b000 - 000191c000] BOOTMEM [0.00] #12 [000220 - 00023c]MEMMAP 0 [0.00] #13 [0001918640 - 00019187c0] BOOTMEM [0.00] #14 [000191c000 - 000191cc00] BOOTMEM [0.00] #15 [000191d000 - 000191e000] BOOTMEM [0.00] #16 [000191e000 - 000191f000] BOOTMEM [0.00] #17 [000191f000 - 000192] BOOTMEM [0.00] #18 [00019187c0 - 00019188a0] BOOTMEM [0.00] #19 [00019188c0 - 0001918928] BOOTMEM [0.00] #20 [0001918940 - 00019189a8] BOOTMEM [0.00] #21 [00019189c0 - 0001918a28] BOOTMEM [0.00] #22 [0001918a40 - 0001918a41] BOOTMEM [0.00] #23 [0001918a80 - 0001918a81] BOOTMEM [0.00] #24 [0001987000 - 00019c1000] BOOTMEM [0.00] #25 [0001918ac0 - 0001918ac8] BOOTMEM [0.00] #26 [0001918b00 - 0001918b08] BOOTMEM [0.00] #27 [0001918b40 - 0001918b48] BOOTMEM [0.00] #28 [0001918b80 - 0001918b90] BOOTMEM [0.00] #29 [0001918bc0 - 0001918cc0] BOOTMEM [0.00] #30 [0001918cc0 - 0001918d08] BOOTMEM [0.00] #31 [0001918d40 - 0001918d88] BOOTMEM [0.00] #32 [000192 - 0001921000] BOOTMEM [0.00] Memory: 118724k/131072k available (3327k kernel code, 448k absent, 11900k reserved, 3931k data, 440k init) [0.00] SLUB: Genslabs=14, HWalign=64, Order=0-3, MinObjects=0, CPUs=2, Nodes=1 [0.00] Hierarchical RCU implementation. [0.00]RCU-based detection of stalled CPUs is disabled. [0.00]Verbose stalled-CPUs detection is disabled. [0.00] NR_IRQS:2304 [0.00] Console: colour dummy device 80x25 [0.00] console [tty0] enabled [0.00] console [hvc0] enabled [0.00] installing Xen timer for CPU 0 [0.00] BUG: unable to handle kernel NULL pointer dereference at (null) [0.00] IP: [(null)] (null) [0.00] PGD 0 [0.00] Oops: 0010 [#1] SMP [0.00] last sysfs file: [0.00] CPU 0 [0.00] Modules linked in: [0.00] [0.00] Pid: 0, comm: swapper Not tainted 2.6.35.5 #1 / [0.00] RIP: e030:[] [(null)] (null) [0.00] RSP: e02b:81601d70 EFLAGS: 00010082 [0.00] RAX: 818fdb50 RBX: RCX: [0.00] RDX: RSI: 818c7958 RDI: [0.00] RBP: 81601d88 R08: ea1b22d8 R09: 001a [0.00] R10: R11: 6477 R12: 81623280 [0.00] R13: R14: ffea R15: [0.00] FS: () GS:880001987000() knlGS: [0.00] CS: e033 DS: ES: CR0: 8005003b [0.00] CR2: CR3: 016b9000
Re: [PATCH 1/4] x86: remove cast from void*
On 09/13/2010 10:18 PM, matt mooney wrote: Unnecessary cast from void* in assignment. Signed-off-by: matt mooney m...@muteddisk.com --- arch/x86/xen/mmu.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index 42086ac..7436283 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -737,7 +737,7 @@ pgd_t *xen_get_user_pgd(pgd_t *pgd) if (offset pgd_index(USER_LIMIT)) { struct page *page = virt_to_page(pgd_page); - user_ptr = (pgd_t *)page-private; + user_ptr = page-private; Um, page-private is unsigned long anyway, so this is needed either way. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/4] x86: remove cast from void*
On 09/13/2010 10:18 PM, matt mooney wrote: Unnecessary cast from void* in assignment. Not very keen on this. The cast may not be strictly required, but it does document what's going on there. J Signed-off-by: matt mooney m...@muteddisk.com --- arch/x86/xen/mmu.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c index 42086ac..7436283 100644 --- a/arch/x86/xen/mmu.c +++ b/arch/x86/xen/mmu.c @@ -737,7 +737,7 @@ pgd_t *xen_get_user_pgd(pgd_t *pgd) if (offset pgd_index(USER_LIMIT)) { struct page *page = virt_to_page(pgd_page); - user_ptr = (pgd_t *)page-private; + user_ptr = page-private; if (user_ptr) user_ptr += offset; } ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/4] x86: remove cast from void*
On 09/14/2010 11:20 AM, H. Peter Anvin wrote: On 09/14/2010 10:49 AM, Jeremy Fitzhardinge wrote: On 09/13/2010 10:18 PM, matt mooney wrote: Unnecessary cast from void* in assignment. Not very keen on this. The cast may not be strictly required, but it does document what's going on there. But unnecessary casts are problematic in that if the type changes, they can hide a real bug in the future. If page-private changes in a way which invalidates this kind of usage, I would hope it would also change name to something more descriptive of its new role. If it changes type but is broadly compatible with this usage (say, void * - unsigned long), then the cast won't matter, or will become a requirement. I'm more concerned about user_ptr changing type, in which case the cast will highlight the problem. In general, the casts from these kinds of generic/polymorphic fields let you easily eyeball how they're being used, without having to work out what the type of the other side is. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/virtualization