Re: [PATCH 0/9] qspinlock stuff -v15
On 03/30/2015 12:29 PM, Peter Zijlstra wrote: On Mon, Mar 30, 2015 at 12:25:12PM -0400, Waiman Long wrote: I did it differently in my PV portion of the qspinlock patch. Instead of just waking up the CPU, the new lock holder will check if the new queue head has been halted. If so, it will set the slowpath flag for the halted queue head in the lock so as to wake it up at unlock time. This should eliminate your concern of dong twice as many VMEXIT in an overcommitted scenario. We can still do that on top of all this right? As you might have realized I'm a fan of gradual complexity :-) Of course. I am just saying that the concern can be addressed with some additional code change. -Longman -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/9] qspinlock stuff -v15
On 03/27/2015 10:07 AM, Konrad Rzeszutek Wilk wrote: On Thu, Mar 26, 2015 at 09:21:53PM +0100, Peter Zijlstra wrote: On Wed, Mar 25, 2015 at 03:47:39PM -0400, Konrad Rzeszutek Wilk wrote: Ah nice. That could be spun out as a seperate patch to optimize the existing ticket locks I presume. Yes I suppose we can do something similar for the ticket and patch in the right increment. We'd need to restructure the code a bit, but its not fundamentally impossible. We could equally apply the head hashing to the current ticket implementation and avoid the current bitmap iteration. Now with the old pv ticketlock code an vCPU would only go to sleep once and be woken up when it was its turn. With this new code it is woken up twice (and twice it goes to sleep). With an overcommit scenario this would imply that we will have at least twice as many VMEXIT as with the previous code. An astute observation, I had not considered that. Thank you. I presume when you did benchmarking this did not even register? Thought I wonder if it would if you ran the benchmark for a week or so. You presume I benchmarked :-) I managed to boot something virt and run hackbench in it. I wouldn't know a representative virt setup if I ran into it. The thing is, we want this qspinlock for real hardware because its faster and I really want to avoid having to carry two spinlock implementations -- although I suppose that if we really really have to we could. In some way you already have that - for virtualized environments where you don't have an PV mechanism you just use the byte spinlock - which is good. And switching to PV ticketlock implementation after boot.. ugh. I feel your pain. What if you used an PV bytelock implemenation? The code you posted already 'sprays' all the vCPUS to wake up. And that is exactly what you need for PV bytelocks - well, you only need to wake up the vCPUS that have gone to sleep waiting on an specific 'struct spinlock' and just stash those in an per-cpu area. The old Xen spinlock code (Before 3.11?) had this. Just an idea thought. The current code should have just waken up one sleeping vCPU. We shouldn't want to wake up all of them and have almost all except one go back to sleep. I think the PV bytelock you suggest is workable. It should also simplify the implementation. It is just a matter of how much we value the fairness attribute of the PV ticket or queue spinlock implementation that we have. -Longman -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/9] qspinlock stuff -v15
On Mon, Mar 30, 2015 at 12:25:12PM -0400, Waiman Long wrote: I did it differently in my PV portion of the qspinlock patch. Instead of just waking up the CPU, the new lock holder will check if the new queue head has been halted. If so, it will set the slowpath flag for the halted queue head in the lock so as to wake it up at unlock time. This should eliminate your concern of dong twice as many VMEXIT in an overcommitted scenario. We can still do that on top of all this right? As you might have realized I'm a fan of gradual complexity :-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/9] qspinlock stuff -v15
On 03/25/2015 03:47 PM, Konrad Rzeszutek Wilk wrote: On Mon, Mar 16, 2015 at 02:16:13PM +0100, Peter Zijlstra wrote: Hi Waiman, As promised; here is the paravirt stuff I did during the trip to BOS last week. All the !paravirt patches are more or less the same as before (the only real change is the copyright lines in the first patch). The paravirt stuff is 'simple' and KVM only -- the Xen code was a little more convoluted and I've no real way to test that but it should be stright fwd to make work. I ran this using the virtme tool (thanks Andy) on my laptop with a 4x overcommit on vcpus (16 vcpus as compared to the 4 my laptop actually has) and it both booted and survived a hackbench run (perf bench sched messaging -g 20 -l 5000). So while the paravirt code isn't the most optimal code ever conceived it does work. Also, the paravirt patching includes replacing the call with movb $0, %arg1 for the native case, which should greatly reduce the cost of having CONFIG_PARAVIRT_SPINLOCKS enabled on actual hardware. Ah nice. That could be spun out as a seperate patch to optimize the existing ticket locks I presume. The goal is to replace ticket spinlock by queue spinlock. We may not want to support 2 different spinlock implementations in the kernel. Now with the old pv ticketlock code an vCPU would only go to sleep once and be woken up when it was its turn. With this new code it is woken up twice (and twice it goes to sleep). With an overcommit scenario this would imply that we will have at least twice as many VMEXIT as with the previous code. I did it differently in my PV portion of the qspinlock patch. Instead of just waking up the CPU, the new lock holder will check if the new queue head has been halted. If so, it will set the slowpath flag for the halted queue head in the lock so as to wake it up at unlock time. This should eliminate your concern of dong twice as many VMEXIT in an overcommitted scenario. BTW, I did some qspinlock vs. ticketspinlock benchmarks using AIM7 high_systime workload on a 4-socket IvyBridge-EX system (60 cores, 120 threads) with some interesting results. In term of the performance benefit of this patch, I ran the high_systime workload (which does a lot of fork() and exit()) at various load levels (500, 1000, 1500 and 2000 users) on a 4-socket IvyBridge-EX bare-metal system (60 cores, 120 threads) with intel_pstate driver and performance scaling governor. The JPM (jobs/minutes) and execution time results were as follows: Kernel JPMExecution Time -- ----- At 500 users: 3.19118857.1426.25s 3.19-qspinlock134889.7523.13s % change +13.5%-11.9% At 1000 users: 3.19204255.3230.55s 3.19-qspinlock239631.3426.04s % change +17.3%-14.8% At 1500 users: 3.19177272.7352.80s 3.19-qspinlock326132.4028.70s % change +84.0%-45.6% At 2000 users: 3.19196690.3163.45s 3.19-qspinlock341730.5636.52s % change +73.7%-42.4% It turns out that this workload was causing quite a lot of spinlock contention in the vanilla 3.19 kernel. The performance advantage of this patch increases with heavier loads. With the powersave governor, the JPM data were as follows: Users3.19 3.19-qspinlock % Change --- 500 112635.38 132596.69 +17.7% 1000 171240.40 240369.80 +40.4% 1500 130507.53 324436.74 +148.6% 2000 175972.93 341637.01 +94.1% With the qspinlock patch, there wasn't too much difference in performance between the 2 scaling governors. Without this patch, the powersave governor was much slower than the performance governor. By disabling the intel_pstate driver and use acpi_cpufreq instead, the benchmark performance (JPM) at 1000 users level for the performance and ondemand governors were: Governor 3.193.19-qspinlock % Change -- performance 124949.94 219950.65+76.0% ondemand 4838.90 206690.96+4171% The performance was just horrible when there was significant spinlock contention with the ondemand governor. There was also significant run-to-run variation. A second run of the same benchmark gave a result of 22115 JPMs. With the qspinlock patch, however, the performance was much more stable under different cpufreq drivers and governors. That is not the case with the default ticket spinlock implementation. The %CPU times spent on spinlock contention (from perf) with the performance governor and the intel_pstate driver were: Kernel Function3.19 kernel3.19-qspinlock kernel ------
Re: [PATCH 0/9] qspinlock stuff -v15
On 03/16/2015 06:46 PM, Peter Zijlstra wrote: Hi Waiman, As promised; here is the paravirt stuff I did during the trip to BOS last week. All the !paravirt patches are more or less the same as before (the only real change is the copyright lines in the first patch). The paravirt stuff is 'simple' and KVM only -- the Xen code was a little more convoluted and I've no real way to test that but it should be stright fwd to make work. I ran this using the virtme tool (thanks Andy) on my laptop with a 4x overcommit on vcpus (16 vcpus as compared to the 4 my laptop actually has) and it both booted and survived a hackbench run (perf bench sched messaging -g 20 -l 5000). So while the paravirt code isn't the most optimal code ever conceived it does work. Also, the paravirt patching includes replacing the call with movb $0, %arg1 for the native case, which should greatly reduce the cost of having CONFIG_PARAVIRT_SPINLOCKS enabled on actual hardware. I feel that if someone were to do a Xen patch we can go ahead and merge this stuff (finally!). These patches do not implement the paravirt spinlock debug stats currently implemented (separately) by KVM and Xen, but that should not be too hard to do on top and in the 'generic' code -- no reason to duplicate all that. Of course; once this lands people can look at improving the paravirt nonsense. last time I had reported some hangs in kvm case, and I can confirm that the current set of patches works fine. Feel free to add Tested-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com #kvm pv As far as performance is concerned (with my 16core +ht machine having 16vcpu guests [ even w/ , w/o the lfsr hash patchset ]), I do not see any significant observations to report, though I understand that we could see much more benefit with large number of vcpus because of possible reduction in cache bouncing. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/9] qspinlock stuff -v15
On Thu, Mar 26, 2015 at 09:21:53PM +0100, Peter Zijlstra wrote: On Wed, Mar 25, 2015 at 03:47:39PM -0400, Konrad Rzeszutek Wilk wrote: Ah nice. That could be spun out as a seperate patch to optimize the existing ticket locks I presume. Yes I suppose we can do something similar for the ticket and patch in the right increment. We'd need to restructure the code a bit, but its not fundamentally impossible. We could equally apply the head hashing to the current ticket implementation and avoid the current bitmap iteration. Now with the old pv ticketlock code an vCPU would only go to sleep once and be woken up when it was its turn. With this new code it is woken up twice (and twice it goes to sleep). With an overcommit scenario this would imply that we will have at least twice as many VMEXIT as with the previous code. An astute observation, I had not considered that. Thank you. I presume when you did benchmarking this did not even register? Thought I wonder if it would if you ran the benchmark for a week or so. You presume I benchmarked :-) I managed to boot something virt and run hackbench in it. I wouldn't know a representative virt setup if I ran into it. The thing is, we want this qspinlock for real hardware because its faster and I really want to avoid having to carry two spinlock implementations -- although I suppose that if we really really have to we could. In some way you already have that - for virtualized environments where you don't have an PV mechanism you just use the byte spinlock - which is good. And switching to PV ticketlock implementation after boot.. ugh. I feel your pain. What if you used an PV bytelock implemenation? The code you posted already 'sprays' all the vCPUS to wake up. And that is exactly what you need for PV bytelocks - well, you only need to wake up the vCPUS that have gone to sleep waiting on an specific 'struct spinlock' and just stash those in an per-cpu area. The old Xen spinlock code (Before 3.11?) had this. Just an idea thought. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/9] qspinlock stuff -v15
On Wed, Mar 25, 2015 at 03:47:39PM -0400, Konrad Rzeszutek Wilk wrote: Ah nice. That could be spun out as a seperate patch to optimize the existing ticket locks I presume. Yes I suppose we can do something similar for the ticket and patch in the right increment. We'd need to restructure the code a bit, but its not fundamentally impossible. We could equally apply the head hashing to the current ticket implementation and avoid the current bitmap iteration. Now with the old pv ticketlock code an vCPU would only go to sleep once and be woken up when it was its turn. With this new code it is woken up twice (and twice it goes to sleep). With an overcommit scenario this would imply that we will have at least twice as many VMEXIT as with the previous code. An astute observation, I had not considered that. I presume when you did benchmarking this did not even register? Thought I wonder if it would if you ran the benchmark for a week or so. You presume I benchmarked :-) I managed to boot something virt and run hackbench in it. I wouldn't know a representative virt setup if I ran into it. The thing is, we want this qspinlock for real hardware because its faster and I really want to avoid having to carry two spinlock implementations -- although I suppose that if we really really have to we could. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/9] qspinlock stuff -v15
On Mon, Mar 16, 2015 at 02:16:13PM +0100, Peter Zijlstra wrote: Hi Waiman, As promised; here is the paravirt stuff I did during the trip to BOS last week. All the !paravirt patches are more or less the same as before (the only real change is the copyright lines in the first patch). The paravirt stuff is 'simple' and KVM only -- the Xen code was a little more convoluted and I've no real way to test that but it should be stright fwd to make work. I ran this using the virtme tool (thanks Andy) on my laptop with a 4x overcommit on vcpus (16 vcpus as compared to the 4 my laptop actually has) and it both booted and survived a hackbench run (perf bench sched messaging -g 20 -l 5000). So while the paravirt code isn't the most optimal code ever conceived it does work. Also, the paravirt patching includes replacing the call with movb $0, %arg1 for the native case, which should greatly reduce the cost of having CONFIG_PARAVIRT_SPINLOCKS enabled on actual hardware. Ah nice. That could be spun out as a seperate patch to optimize the existing ticket locks I presume. Now with the old pv ticketlock code an vCPU would only go to sleep once and be woken up when it was its turn. With this new code it is woken up twice (and twice it goes to sleep). With an overcommit scenario this would imply that we will have at least twice as many VMEXIT as with the previous code. I presume when you did benchmarking this did not even register? Thought I wonder if it would if you ran the benchmark for a week or so. I feel that if someone were to do a Xen patch we can go ahead and merge this stuff (finally!). These patches do not implement the paravirt spinlock debug stats currently implemented (separately) by KVM and Xen, but that should not be too hard to do on top and in the 'generic' code -- no reason to duplicate all that. Of course; once this lands people can look at improving the paravirt nonsense. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/9] qspinlock stuff -v15
On 03/16/2015 09:16 AM, Peter Zijlstra wrote: Hi Waiman, As promised; here is the paravirt stuff I did during the trip to BOS last week. All the !paravirt patches are more or less the same as before (the only real change is the copyright lines in the first patch). The paravirt stuff is 'simple' and KVM only -- the Xen code was a little more convoluted and I've no real way to test that but it should be stright fwd to make work. I ran this using the virtme tool (thanks Andy) on my laptop with a 4x overcommit on vcpus (16 vcpus as compared to the 4 my laptop actually has) and it both booted and survived a hackbench run (perf bench sched messaging -g 20 -l 5000). So while the paravirt code isn't the most optimal code ever conceived it does work. Also, the paravirt patching includes replacing the call with movb $0, %arg1 for the native case, which should greatly reduce the cost of having CONFIG_PARAVIRT_SPINLOCKS enabled on actual hardware. I feel that if someone were to do a Xen patch we can go ahead and merge this stuff (finally!). These patches do not implement the paravirt spinlock debug stats currently implemented (separately) by KVM and Xen, but that should not be too hard to do on top and in the 'generic' code -- no reason to duplicate all that. Of course; once this lands people can look at improving the paravirt nonsense. Thanks for sending this out. I have no problem with the !paravirt patch. I do have some comments on the paravirt one which I will reply individually. Cheers, Longman -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html