Re: sysbench throughput degradation in 4.13+
On 10/10/2017 07:26 PM, Ingo Molnar wrote: > > * Peter Zijlstrawrote: > >> On Tue, Oct 10, 2017 at 03:51:37PM +0100, Matt Fleming wrote: >>> On Fri, 06 Oct, at 11:36:23AM, Matt Fleming wrote: It's a similar story for hackbench-threads-{pipes,sockets}, i.e. pipes regress but performance is restored for sockets. Of course, like a dope, I forgot to re-run netperf with your WA_WEIGHT patch. So I've queued that up now and it should be done by tomorrow. >>> >>> Yeah, netperf results look fine for either your NO_WA_WEIGHT or >>> WA_WEIGHT patch. >>> >>> Any ETA on when this is going to tip? >> >> Just hit a few hours ago :-) > > I admit that time machines are really handy! > > Thanks, Are we going to schedule this for 4.13 stable as well?
Re: sysbench throughput degradation in 4.13+
On 10/10/2017 07:26 PM, Ingo Molnar wrote: > > * Peter Zijlstra wrote: > >> On Tue, Oct 10, 2017 at 03:51:37PM +0100, Matt Fleming wrote: >>> On Fri, 06 Oct, at 11:36:23AM, Matt Fleming wrote: It's a similar story for hackbench-threads-{pipes,sockets}, i.e. pipes regress but performance is restored for sockets. Of course, like a dope, I forgot to re-run netperf with your WA_WEIGHT patch. So I've queued that up now and it should be done by tomorrow. >>> >>> Yeah, netperf results look fine for either your NO_WA_WEIGHT or >>> WA_WEIGHT patch. >>> >>> Any ETA on when this is going to tip? >> >> Just hit a few hours ago :-) > > I admit that time machines are really handy! > > Thanks, Are we going to schedule this for 4.13 stable as well?
Re: sysbench throughput degradation in 4.13+
* Peter Zijlstrawrote: > On Tue, Oct 10, 2017 at 03:51:37PM +0100, Matt Fleming wrote: > > On Fri, 06 Oct, at 11:36:23AM, Matt Fleming wrote: > > > > > > It's a similar story for hackbench-threads-{pipes,sockets}, i.e. pipes > > > regress but performance is restored for sockets. > > > > > > Of course, like a dope, I forgot to re-run netperf with your WA_WEIGHT > > > patch. So I've queued that up now and it should be done by tomorrow. > > > > Yeah, netperf results look fine for either your NO_WA_WEIGHT or > > WA_WEIGHT patch. > > > > Any ETA on when this is going to tip? > > Just hit a few hours ago :-) I admit that time machines are really handy! Thanks, Ingo
Re: sysbench throughput degradation in 4.13+
* Peter Zijlstra wrote: > On Tue, Oct 10, 2017 at 03:51:37PM +0100, Matt Fleming wrote: > > On Fri, 06 Oct, at 11:36:23AM, Matt Fleming wrote: > > > > > > It's a similar story for hackbench-threads-{pipes,sockets}, i.e. pipes > > > regress but performance is restored for sockets. > > > > > > Of course, like a dope, I forgot to re-run netperf with your WA_WEIGHT > > > patch. So I've queued that up now and it should be done by tomorrow. > > > > Yeah, netperf results look fine for either your NO_WA_WEIGHT or > > WA_WEIGHT patch. > > > > Any ETA on when this is going to tip? > > Just hit a few hours ago :-) I admit that time machines are really handy! Thanks, Ingo
Re: sysbench throughput degradation in 4.13+
On Tue, Oct 10, 2017 at 03:51:37PM +0100, Matt Fleming wrote: > On Fri, 06 Oct, at 11:36:23AM, Matt Fleming wrote: > > > > It's a similar story for hackbench-threads-{pipes,sockets}, i.e. pipes > > regress but performance is restored for sockets. > > > > Of course, like a dope, I forgot to re-run netperf with your WA_WEIGHT > > patch. So I've queued that up now and it should be done by tomorrow. > > Yeah, netperf results look fine for either your NO_WA_WEIGHT or > WA_WEIGHT patch. > > Any ETA on when this is going to tip? Just hit a few hours ago :-)
Re: sysbench throughput degradation in 4.13+
On Tue, Oct 10, 2017 at 03:51:37PM +0100, Matt Fleming wrote: > On Fri, 06 Oct, at 11:36:23AM, Matt Fleming wrote: > > > > It's a similar story for hackbench-threads-{pipes,sockets}, i.e. pipes > > regress but performance is restored for sockets. > > > > Of course, like a dope, I forgot to re-run netperf with your WA_WEIGHT > > patch. So I've queued that up now and it should be done by tomorrow. > > Yeah, netperf results look fine for either your NO_WA_WEIGHT or > WA_WEIGHT patch. > > Any ETA on when this is going to tip? Just hit a few hours ago :-)
Re: sysbench throughput degradation in 4.13+
On Fri, 06 Oct, at 11:36:23AM, Matt Fleming wrote: > > It's a similar story for hackbench-threads-{pipes,sockets}, i.e. pipes > regress but performance is restored for sockets. > > Of course, like a dope, I forgot to re-run netperf with your WA_WEIGHT > patch. So I've queued that up now and it should be done by tomorrow. Yeah, netperf results look fine for either your NO_WA_WEIGHT or WA_WEIGHT patch. Any ETA on when this is going to tip?
Re: sysbench throughput degradation in 4.13+
On Fri, 06 Oct, at 11:36:23AM, Matt Fleming wrote: > > It's a similar story for hackbench-threads-{pipes,sockets}, i.e. pipes > regress but performance is restored for sockets. > > Of course, like a dope, I forgot to re-run netperf with your WA_WEIGHT > patch. So I've queued that up now and it should be done by tomorrow. Yeah, netperf results look fine for either your NO_WA_WEIGHT or WA_WEIGHT patch. Any ETA on when this is going to tip?
Re: sysbench throughput degradation in 4.13+
On Wed, 04 Oct, at 06:18:50PM, Peter Zijlstra wrote: > On Tue, Oct 03, 2017 at 10:39:32AM +0200, Peter Zijlstra wrote: > > So I was waiting for Rik, who promised to run a bunch of NUMA workloads > > over the weekend. > > > > The trivial thing regresses a wee bit on the overloaded case, I've not > > yet tried to fix it. > > WA_IDLE is my 'old' patch and what you all tested, WA_WEIGHT is the > addition -- based on the old scheme -- that I've tried in order to lift > the overloaded case (including hackbench). My results (2 nodes, 12 cores/node, 2 threads/core) show that you've pretty much restored hackbench performance to v4.12. However, it's a regression against v4.13 for hackbench-process-pipes (I'm guessing the v4.13 improvement is due to Rik's original patches). The last two result columns are your latest patch with NO_WA_WEIGHT and then with WA_WEIGHT enabled. (I hope you've all got wide terminals) hackbench-process-pipes 4.12.0 4.13.0 4.13.0 4.13.0 4.13.0 4.13.0 vanillavanilla peterz-fix rik-fixpeterz-fix-v2-no-wa-weightpeterz-fix-v2-wa-weight Amean 11.1600 ( 0.00%) 1.6037 ( -38.25%) 1.0727 ( 7.53%) 1.0200 ( 12.07%) 1.0837 ( 6.58%) 1.1110 ( 4.22%) Amean 42.4207 ( 0.00%) 2.2300 ( 7.88%) 2.0520 ( 15.23%) 1.9483 ( 19.51%) 2.0623 ( 14.80%) 2.2807 ( 5.78%) Amean 75.4140 ( 0.00%) 3.2027 ( 40.84%) 3.6100 ( 33.32%) 3.5620 ( 34.21%) 3.5590 ( 34.26%) 4.6573 ( 13.98%) Amean 12 9.7130 ( 0.00%) 4.7170 ( 51.44%) 6.5280 ( 32.79%) 6.2063 ( 36.10%) 6.5670 ( 32.39%) 10.5440 ( -8.56%) Amean 21 11.6687 ( 0.00%) 8.8073 ( 24.52%) 14.4997 ( -24.26%) 10.2700 ( 11.99%) 14.3187 ( -22.71%) 11.5417 ( 1.09%) Amean 30 14.6410 ( 0.00%) 11.7003 ( 20.09%) 23.7660 ( -62.32%) 13.9847 ( 4.48%) 21.8957 ( -49.55%) 14.4847 ( 1.07%) Amean 48 19.8780 ( 0.00%) 17.0317 ( 14.32%) 37.6397 ( -89.35%) 19.7577 ( 0.61%) 39.2110 ( -97.26%) 20.3293 ( -2.27%) Amean 79 46.4200 ( 0.00%) 27.1180 ( 41.58%) 58.4037 ( -25.82%) 35.5537 ( 23.41%) 60.8957 ( -31.18%) 49.7470 ( -7.17%) Amean 110 57.7550 ( 0.00%) 42.7013 ( 26.06%) 73.0483 ( -26.48%) 48.8880 ( 15.35%) 77.8597 ( -34.81%) 61.9353 ( -7.24%) Amean 141 61.0490 ( 0.00%) 48.0857 ( 21.23%) 98.5567 ( -61.44%) 63.2187 ( -3.55%) 90.4857 ( -48.22%) 68.3100 ( -11.89%) Amean 172 70.5180 ( 0.00%) 59.3620 ( 15.82%)122.5423 ( -73.77%) 76.0197 ( -7.80%)127.4023 ( -80.67%) 75.8233 ( -7.52%) Amean 192 76.1643 ( 0.00%) 65.1613 ( 14.45%)142.1393 ( -86.62%) 91.4923 ( -20.12%)145.0663 ( -90.46%) 80.5867 ( -5.81%) But things look pretty good for hackbench-process-sockets: hackbench-process-sockets 4.12.0 4.13.0 4.13.0 4.13.0 4.13.0 4.13.0 vanillavanilla peterz-fix rik-fixpeterz-fix-v2-no-wa-weightpeterz-fix-v2-wa-weight Amean 10.9657 ( 0.00%) 1.0850 ( -12.36%) 1.3737 ( -42.25%) 1.3093 ( -35.59%) 1.3220 ( -36.90%) 1.3937 ( -44.32%) Amean 42.3040 ( 0.00%) 3.3840 ( -46.88%) 2.1807 ( 5.35%) 2.3010 ( 0.13%) 2.2070 ( 4.21%) 2.1770 ( 5.51%) Amean 74.5467 ( 0.00%) 4.0787 ( 10.29%) 5.0530 ( -11.14%) 3.7427 ( 17.68%) 4.5517 ( -0.11%) 3.8560 ( 15.19%) Amean 12 5.7707 ( 0.00%) 5.4440 ( 5.66%) 10.5680 ( -83.13%) 7.7240 ( -33.85%) 10.5990 ( -83.67%) 5.9123 ( -2.45%) Amean 21 8.9387 ( 0.00%) 9.5850 ( -7.23%) 18.3103 (-104.84%) 10.9253 ( -22.23%) 18.1540 (-103.10%) 9.2627 ( -3.62%) Amean 30 13.1243 ( 0.00%) 14.0773 ( -7.26%) 25.6563 ( -95.49%) 15.7590 ( -20.07%) 25.6920 ( -95.76%) 14.6523 ( -11.64%) Amean 48 25.1030 ( 0.00%) 22.5233 ( 10.28%) 40.5937 ( -61.71%) 24.6727 ( 1.71%) 40.6357 ( -61.88%) 22.1807 ( 11.64%) Amean 79 39.9150 ( 0.00%) 33.4220 ( 16.27%) 66.3343 ( -66.19%) 40.2713 ( -0.89%) 65.8543 ( -64.99%) 35.3360 ( 11.47%) Amean 110 49.1700 ( 0.00%) 46.1173 ( 6.21%) 92.3153 ( -87.75%) 55.6567 ( -13.19%) 92.0567 ( -87.22%) 46.7280 ( 4.97%) Amean 141 59.3157 ( 0.00%) 57.2670 ( 3.45%)118.5863 ( -99.92%) 70.4800 ( -18.82%)118.6013 ( -99.95%)
Re: sysbench throughput degradation in 4.13+
On Wed, 04 Oct, at 06:18:50PM, Peter Zijlstra wrote: > On Tue, Oct 03, 2017 at 10:39:32AM +0200, Peter Zijlstra wrote: > > So I was waiting for Rik, who promised to run a bunch of NUMA workloads > > over the weekend. > > > > The trivial thing regresses a wee bit on the overloaded case, I've not > > yet tried to fix it. > > WA_IDLE is my 'old' patch and what you all tested, WA_WEIGHT is the > addition -- based on the old scheme -- that I've tried in order to lift > the overloaded case (including hackbench). My results (2 nodes, 12 cores/node, 2 threads/core) show that you've pretty much restored hackbench performance to v4.12. However, it's a regression against v4.13 for hackbench-process-pipes (I'm guessing the v4.13 improvement is due to Rik's original patches). The last two result columns are your latest patch with NO_WA_WEIGHT and then with WA_WEIGHT enabled. (I hope you've all got wide terminals) hackbench-process-pipes 4.12.0 4.13.0 4.13.0 4.13.0 4.13.0 4.13.0 vanillavanilla peterz-fix rik-fixpeterz-fix-v2-no-wa-weightpeterz-fix-v2-wa-weight Amean 11.1600 ( 0.00%) 1.6037 ( -38.25%) 1.0727 ( 7.53%) 1.0200 ( 12.07%) 1.0837 ( 6.58%) 1.1110 ( 4.22%) Amean 42.4207 ( 0.00%) 2.2300 ( 7.88%) 2.0520 ( 15.23%) 1.9483 ( 19.51%) 2.0623 ( 14.80%) 2.2807 ( 5.78%) Amean 75.4140 ( 0.00%) 3.2027 ( 40.84%) 3.6100 ( 33.32%) 3.5620 ( 34.21%) 3.5590 ( 34.26%) 4.6573 ( 13.98%) Amean 12 9.7130 ( 0.00%) 4.7170 ( 51.44%) 6.5280 ( 32.79%) 6.2063 ( 36.10%) 6.5670 ( 32.39%) 10.5440 ( -8.56%) Amean 21 11.6687 ( 0.00%) 8.8073 ( 24.52%) 14.4997 ( -24.26%) 10.2700 ( 11.99%) 14.3187 ( -22.71%) 11.5417 ( 1.09%) Amean 30 14.6410 ( 0.00%) 11.7003 ( 20.09%) 23.7660 ( -62.32%) 13.9847 ( 4.48%) 21.8957 ( -49.55%) 14.4847 ( 1.07%) Amean 48 19.8780 ( 0.00%) 17.0317 ( 14.32%) 37.6397 ( -89.35%) 19.7577 ( 0.61%) 39.2110 ( -97.26%) 20.3293 ( -2.27%) Amean 79 46.4200 ( 0.00%) 27.1180 ( 41.58%) 58.4037 ( -25.82%) 35.5537 ( 23.41%) 60.8957 ( -31.18%) 49.7470 ( -7.17%) Amean 110 57.7550 ( 0.00%) 42.7013 ( 26.06%) 73.0483 ( -26.48%) 48.8880 ( 15.35%) 77.8597 ( -34.81%) 61.9353 ( -7.24%) Amean 141 61.0490 ( 0.00%) 48.0857 ( 21.23%) 98.5567 ( -61.44%) 63.2187 ( -3.55%) 90.4857 ( -48.22%) 68.3100 ( -11.89%) Amean 172 70.5180 ( 0.00%) 59.3620 ( 15.82%)122.5423 ( -73.77%) 76.0197 ( -7.80%)127.4023 ( -80.67%) 75.8233 ( -7.52%) Amean 192 76.1643 ( 0.00%) 65.1613 ( 14.45%)142.1393 ( -86.62%) 91.4923 ( -20.12%)145.0663 ( -90.46%) 80.5867 ( -5.81%) But things look pretty good for hackbench-process-sockets: hackbench-process-sockets 4.12.0 4.13.0 4.13.0 4.13.0 4.13.0 4.13.0 vanillavanilla peterz-fix rik-fixpeterz-fix-v2-no-wa-weightpeterz-fix-v2-wa-weight Amean 10.9657 ( 0.00%) 1.0850 ( -12.36%) 1.3737 ( -42.25%) 1.3093 ( -35.59%) 1.3220 ( -36.90%) 1.3937 ( -44.32%) Amean 42.3040 ( 0.00%) 3.3840 ( -46.88%) 2.1807 ( 5.35%) 2.3010 ( 0.13%) 2.2070 ( 4.21%) 2.1770 ( 5.51%) Amean 74.5467 ( 0.00%) 4.0787 ( 10.29%) 5.0530 ( -11.14%) 3.7427 ( 17.68%) 4.5517 ( -0.11%) 3.8560 ( 15.19%) Amean 12 5.7707 ( 0.00%) 5.4440 ( 5.66%) 10.5680 ( -83.13%) 7.7240 ( -33.85%) 10.5990 ( -83.67%) 5.9123 ( -2.45%) Amean 21 8.9387 ( 0.00%) 9.5850 ( -7.23%) 18.3103 (-104.84%) 10.9253 ( -22.23%) 18.1540 (-103.10%) 9.2627 ( -3.62%) Amean 30 13.1243 ( 0.00%) 14.0773 ( -7.26%) 25.6563 ( -95.49%) 15.7590 ( -20.07%) 25.6920 ( -95.76%) 14.6523 ( -11.64%) Amean 48 25.1030 ( 0.00%) 22.5233 ( 10.28%) 40.5937 ( -61.71%) 24.6727 ( 1.71%) 40.6357 ( -61.88%) 22.1807 ( 11.64%) Amean 79 39.9150 ( 0.00%) 33.4220 ( 16.27%) 66.3343 ( -66.19%) 40.2713 ( -0.89%) 65.8543 ( -64.99%) 35.3360 ( 11.47%) Amean 110 49.1700 ( 0.00%) 46.1173 ( 6.21%) 92.3153 ( -87.75%) 55.6567 ( -13.19%) 92.0567 ( -87.22%) 46.7280 ( 4.97%) Amean 141 59.3157 ( 0.00%) 57.2670 ( 3.45%)118.5863 ( -99.92%) 70.4800 ( -18.82%)118.6013 ( -99.95%)
Re: sysbench throughput degradation in 4.13+
On Wed, 2017-10-04 at 18:18 +0200, Peter Zijlstra wrote: > On Tue, Oct 03, 2017 at 10:39:32AM +0200, Peter Zijlstra wrote: > > So I was waiting for Rik, who promised to run a bunch of NUMA > > workloads > > over the weekend. > > > > The trivial thing regresses a wee bit on the overloaded case, I've > > not > > yet tried to fix it. > > WA_IDLE is my 'old' patch and what you all tested, WA_WEIGHT is the > addition -- based on the old scheme -- that I've tried in order to > lift > the overloaded case (including hackbench). > > Its not an unconditional win, but I'm tempted to default enable > WA_WEIGHT too (I've not done NO_WA_IDLE && WA_WEIGHT runs). Enabling both makes sense to me. We have four cases to deal with: - mostly idle system, in that case we don't really care, since select_idle_sibling will find an idle core anywhere - partially loaded system (say 1/2 or 2/3 full), in that case WA_IDLE will be a great policy to help locate an idle CPU - fully loaded system, in this case either policy works well - overloaded system, in this case WA_WEIGHT seems to do the trick, assuming load balancing results in largely similar loads between cores inside each LLC The big danger is affine wakeups messing up the balance the load balancer works on, with the two mechanisms messing up each other's placement. However, there seems to be very little we can actually do about that, without the unacceptable overhead of examining the instantaneous loads on every CPU in an LLC - otherwise we end up either overshooting, or not taking advantage of idle CPUs, due to the use of cached load values. -- All rights reversed signature.asc Description: This is a digitally signed message part
Re: sysbench throughput degradation in 4.13+
On Wed, 2017-10-04 at 18:18 +0200, Peter Zijlstra wrote: > On Tue, Oct 03, 2017 at 10:39:32AM +0200, Peter Zijlstra wrote: > > So I was waiting for Rik, who promised to run a bunch of NUMA > > workloads > > over the weekend. > > > > The trivial thing regresses a wee bit on the overloaded case, I've > > not > > yet tried to fix it. > > WA_IDLE is my 'old' patch and what you all tested, WA_WEIGHT is the > addition -- based on the old scheme -- that I've tried in order to > lift > the overloaded case (including hackbench). > > Its not an unconditional win, but I'm tempted to default enable > WA_WEIGHT too (I've not done NO_WA_IDLE && WA_WEIGHT runs). Enabling both makes sense to me. We have four cases to deal with: - mostly idle system, in that case we don't really care, since select_idle_sibling will find an idle core anywhere - partially loaded system (say 1/2 or 2/3 full), in that case WA_IDLE will be a great policy to help locate an idle CPU - fully loaded system, in this case either policy works well - overloaded system, in this case WA_WEIGHT seems to do the trick, assuming load balancing results in largely similar loads between cores inside each LLC The big danger is affine wakeups messing up the balance the load balancer works on, with the two mechanisms messing up each other's placement. However, there seems to be very little we can actually do about that, without the unacceptable overhead of examining the instantaneous loads on every CPU in an LLC - otherwise we end up either overshooting, or not taking advantage of idle CPUs, due to the use of cached load values. -- All rights reversed signature.asc Description: This is a digitally signed message part
Re: sysbench throughput degradation in 4.13+
On Tue, Oct 03, 2017 at 10:39:32AM +0200, Peter Zijlstra wrote: > So I was waiting for Rik, who promised to run a bunch of NUMA workloads > over the weekend. > > The trivial thing regresses a wee bit on the overloaded case, I've not > yet tried to fix it. WA_IDLE is my 'old' patch and what you all tested, WA_WEIGHT is the addition -- based on the old scheme -- that I've tried in order to lift the overloaded case (including hackbench). Its not an unconditional win, but I'm tempted to default enable WA_WEIGHT too (I've not done NO_WA_IDLE && WA_WEIGHT runs). But let me first write a Changelog for the below and queue that. Then we ran maybe run more things.. On my IVB-EP (2 nodes, 10 cores/node, 2 threads/core): WA_IDLE && NO_WA_WEIGHT: Performance counter stats for 'perf bench sched messaging -g 20 -t -l 1' (10 runs): 7.391856936 seconds time elapsed ( +- 0.66% ) [ ok ] Starting network benchmark server. TCP_SENDFILE-1 : Avg: 54524.6 TCP_SENDFILE-10 : Avg: 48185.2 TCP_SENDFILE-20 : Avg: 29031.2 TCP_SENDFILE-40 : Avg: 9819.72 TCP_SENDFILE-80 : Avg: 5355.3 TCP_STREAM-1 : Avg: 41448.3 TCP_STREAM-10 : Avg: 24123.2 TCP_STREAM-20 : Avg: 15834.5 TCP_STREAM-40 : Avg: 5583.91 TCP_STREAM-80 : Avg: 2329.66 TCP_RR-1 : Avg: 80473.5 TCP_RR-10 : Avg: 72660.5 TCP_RR-20 : Avg: 52607.1 TCP_RR-40 : Avg: 57199.2 TCP_RR-80 : Avg: 25330.3 UDP_RR-1 : Avg: 108266 UDP_RR-10 : Avg: 95480 UDP_RR-20 : Avg: 68770.8 UDP_RR-40 : Avg: 76231 UDP_RR-80 : Avg: 34578.3 UDP_STREAM-1 : Avg: 64684.3 UDP_STREAM-10 : Avg: 52701.2 UDP_STREAM-20 : Avg: 30376.4 UDP_STREAM-40 : Avg: 15685.8 UDP_STREAM-80 : Avg: 8415.13 [ ok ] Stopping network benchmark server. [] Starting MySQL database server: mysqldNo directory, logging in with HOME=/. ok 2: [30 secs] transactions:64057 (2135.17 per sec.) 5: [30 secs] transactions:144295 (4809.68 per sec.) 10: [30 secs] transactions:274768 (9158.59 per sec.) 20: [30 secs] transactions:437140 (14570.70 per sec.) 40: [30 secs] transactions:663949 (22130.56 per sec.) 80: [30 secs] transactions:629927 (20995.56 per sec.) [ ok ] Stopping MySQL database server: mysqld. [ ok ] Starting PostgreSQL 9.4 database server: main. 2: [30 secs] transactions:50389 (1679.58 per sec.) 5: [30 secs] transactions:113934 (3797.69 per sec.) 10: [30 secs] transactions:217606 (7253.22 per sec.) 20: [30 secs] transactions:335021 (11166.75 per sec.) 40: [30 secs] transactions:518355 (17277.28 per sec.) 80: [30 secs] transactions:513424 (17112.44 per sec.) [ ok ] Stopping PostgreSQL 9.4 database server: main. Latency percentiles (usec) 50.th: 2 75.th: 3 90.th: 3 95.th: 3 *99.th: 3 99.5000th: 3 99.9000th: 4 min=0, max=86 avg worker transfer: 190227.78 ops/sec 743.08KB/s rps: 1004.94 p95 (usec) 6136 p99 (usec) 6152 p95/cputime 20.45% p99/cputime 20.51% rps: 1052.58 p95 (usec) 7208 p99 (usec) 7224 p95/cputime 24.03% p99/cputime 24.08% rps: 1076.40 p95 (usec) 7720 p99 (usec) 7736 p95/cputime 25.73% p99/cputime 25.79% rps: 1100.27 p95 (usec) 8208 p99 (usec) 8208 p95/cputime 27.36% p99/cputime 27.36% rps: 1147.96 p95 (usec) 9104 p99 (usec) 9136 p95/cputime 30.35% p99/cputime 30.45% rps: 1171.78 p95 (usec) 9552 p99 (usec) 9552 p95/cputime 31.84% p99/cputime 31.84% rps: 1220.04 p95 (usec) 12336 p99 (usec) 12336 p95/cputime 41.12% p99/cputime 41.12% rps: 1243.82 p95 (usec) 14960 p99 (usec) 14992 p95/cputime 49.87% p99/cputime 49.97% rps: 1243.88 p95 (usec) 14960 p99 (usec) 14992 p95/cputime 49.87% p99/cputime 49.97% rps: 1266.39 p95 (usec) 227584 p99 (usec) 239360 p95/cputime 758.61% p99/cputime 797.87% Latency percentiles (usec) 50.th: 62 75.th: 101 90.th: 108 95.th: 112 *99.th: 119 99.5000th: 124 99.9000th: 4920 min=0, max=12987 Throughput 664.328 MB/sec 2 clients 2 procs max_latency=0.076 ms Throughput 1573.72 MB/sec 5 clients 5 procs max_latency=0.102 ms Throughput 2948.7 MB/sec 10 clients 10 procs max_latency=0.198 ms Throughput 4602.38 MB/sec 20 clients 20 procs max_latency=1.712 ms Throughput 9253.17 MB/sec 40 clients 40 procs max_latency=2.047 ms Throughput 8056.01 MB/sec 80 clients 80 procs max_latency=35.819 ms --- WA_IDLE && WA_WEIGHT: Performance counter stats for 'perf bench sched messaging -g 20 -t -l 1' (10 runs): 6.500797532 seconds time elapsed ( +- 0.97% ) [ ok ] Starting network benchmark server. TCP_SENDFILE-1 : Avg: 52224.3
Re: sysbench throughput degradation in 4.13+
On Tue, Oct 03, 2017 at 10:39:32AM +0200, Peter Zijlstra wrote: > So I was waiting for Rik, who promised to run a bunch of NUMA workloads > over the weekend. > > The trivial thing regresses a wee bit on the overloaded case, I've not > yet tried to fix it. WA_IDLE is my 'old' patch and what you all tested, WA_WEIGHT is the addition -- based on the old scheme -- that I've tried in order to lift the overloaded case (including hackbench). Its not an unconditional win, but I'm tempted to default enable WA_WEIGHT too (I've not done NO_WA_IDLE && WA_WEIGHT runs). But let me first write a Changelog for the below and queue that. Then we ran maybe run more things.. On my IVB-EP (2 nodes, 10 cores/node, 2 threads/core): WA_IDLE && NO_WA_WEIGHT: Performance counter stats for 'perf bench sched messaging -g 20 -t -l 1' (10 runs): 7.391856936 seconds time elapsed ( +- 0.66% ) [ ok ] Starting network benchmark server. TCP_SENDFILE-1 : Avg: 54524.6 TCP_SENDFILE-10 : Avg: 48185.2 TCP_SENDFILE-20 : Avg: 29031.2 TCP_SENDFILE-40 : Avg: 9819.72 TCP_SENDFILE-80 : Avg: 5355.3 TCP_STREAM-1 : Avg: 41448.3 TCP_STREAM-10 : Avg: 24123.2 TCP_STREAM-20 : Avg: 15834.5 TCP_STREAM-40 : Avg: 5583.91 TCP_STREAM-80 : Avg: 2329.66 TCP_RR-1 : Avg: 80473.5 TCP_RR-10 : Avg: 72660.5 TCP_RR-20 : Avg: 52607.1 TCP_RR-40 : Avg: 57199.2 TCP_RR-80 : Avg: 25330.3 UDP_RR-1 : Avg: 108266 UDP_RR-10 : Avg: 95480 UDP_RR-20 : Avg: 68770.8 UDP_RR-40 : Avg: 76231 UDP_RR-80 : Avg: 34578.3 UDP_STREAM-1 : Avg: 64684.3 UDP_STREAM-10 : Avg: 52701.2 UDP_STREAM-20 : Avg: 30376.4 UDP_STREAM-40 : Avg: 15685.8 UDP_STREAM-80 : Avg: 8415.13 [ ok ] Stopping network benchmark server. [] Starting MySQL database server: mysqldNo directory, logging in with HOME=/. ok 2: [30 secs] transactions:64057 (2135.17 per sec.) 5: [30 secs] transactions:144295 (4809.68 per sec.) 10: [30 secs] transactions:274768 (9158.59 per sec.) 20: [30 secs] transactions:437140 (14570.70 per sec.) 40: [30 secs] transactions:663949 (22130.56 per sec.) 80: [30 secs] transactions:629927 (20995.56 per sec.) [ ok ] Stopping MySQL database server: mysqld. [ ok ] Starting PostgreSQL 9.4 database server: main. 2: [30 secs] transactions:50389 (1679.58 per sec.) 5: [30 secs] transactions:113934 (3797.69 per sec.) 10: [30 secs] transactions:217606 (7253.22 per sec.) 20: [30 secs] transactions:335021 (11166.75 per sec.) 40: [30 secs] transactions:518355 (17277.28 per sec.) 80: [30 secs] transactions:513424 (17112.44 per sec.) [ ok ] Stopping PostgreSQL 9.4 database server: main. Latency percentiles (usec) 50.th: 2 75.th: 3 90.th: 3 95.th: 3 *99.th: 3 99.5000th: 3 99.9000th: 4 min=0, max=86 avg worker transfer: 190227.78 ops/sec 743.08KB/s rps: 1004.94 p95 (usec) 6136 p99 (usec) 6152 p95/cputime 20.45% p99/cputime 20.51% rps: 1052.58 p95 (usec) 7208 p99 (usec) 7224 p95/cputime 24.03% p99/cputime 24.08% rps: 1076.40 p95 (usec) 7720 p99 (usec) 7736 p95/cputime 25.73% p99/cputime 25.79% rps: 1100.27 p95 (usec) 8208 p99 (usec) 8208 p95/cputime 27.36% p99/cputime 27.36% rps: 1147.96 p95 (usec) 9104 p99 (usec) 9136 p95/cputime 30.35% p99/cputime 30.45% rps: 1171.78 p95 (usec) 9552 p99 (usec) 9552 p95/cputime 31.84% p99/cputime 31.84% rps: 1220.04 p95 (usec) 12336 p99 (usec) 12336 p95/cputime 41.12% p99/cputime 41.12% rps: 1243.82 p95 (usec) 14960 p99 (usec) 14992 p95/cputime 49.87% p99/cputime 49.97% rps: 1243.88 p95 (usec) 14960 p99 (usec) 14992 p95/cputime 49.87% p99/cputime 49.97% rps: 1266.39 p95 (usec) 227584 p99 (usec) 239360 p95/cputime 758.61% p99/cputime 797.87% Latency percentiles (usec) 50.th: 62 75.th: 101 90.th: 108 95.th: 112 *99.th: 119 99.5000th: 124 99.9000th: 4920 min=0, max=12987 Throughput 664.328 MB/sec 2 clients 2 procs max_latency=0.076 ms Throughput 1573.72 MB/sec 5 clients 5 procs max_latency=0.102 ms Throughput 2948.7 MB/sec 10 clients 10 procs max_latency=0.198 ms Throughput 4602.38 MB/sec 20 clients 20 procs max_latency=1.712 ms Throughput 9253.17 MB/sec 40 clients 40 procs max_latency=2.047 ms Throughput 8056.01 MB/sec 80 clients 80 procs max_latency=35.819 ms --- WA_IDLE && WA_WEIGHT: Performance counter stats for 'perf bench sched messaging -g 20 -t -l 1' (10 runs): 6.500797532 seconds time elapsed ( +- 0.97% ) [ ok ] Starting network benchmark server. TCP_SENDFILE-1 : Avg: 52224.3
Re: sysbench throughput degradation in 4.13+
On Tue, 2017-10-03 at 10:39 +0200, Peter Zijlstra wrote: > On Mon, Oct 02, 2017 at 11:53:12PM +0100, Matt Fleming wrote: > > On Wed, 27 Sep, at 01:58:20PM, Rik van Riel wrote: > > > > > > I like the simplicity of your approach! I hope it does not break > > > stuff like netperf... > > > > > > I have been working on the patch below, which is much less > > > optimistic > > > about when to do an affine wakeup than before. > > > > Running netperf for this patch and Peter's patch shows that Peter's > > comes out on top, with scores pretty close to v4.12 in most places > > on > > my 2-NUMA node 48-CPU Xeon box. > > > > I haven't dug any further into why v4.13-peterz+ is worse than > > v4.12, > > but I will next week. > > So I was waiting for Rik, who promised to run a bunch of NUMA > workloads > over the weekend. > > The trivial thing regresses a wee bit on the overloaded case, I've > not > yet tried to fix it. In Jirka's tests, your simple patch also came out on top. -- All rights reversed signature.asc Description: This is a digitally signed message part
Re: sysbench throughput degradation in 4.13+
On Tue, 2017-10-03 at 10:39 +0200, Peter Zijlstra wrote: > On Mon, Oct 02, 2017 at 11:53:12PM +0100, Matt Fleming wrote: > > On Wed, 27 Sep, at 01:58:20PM, Rik van Riel wrote: > > > > > > I like the simplicity of your approach! I hope it does not break > > > stuff like netperf... > > > > > > I have been working on the patch below, which is much less > > > optimistic > > > about when to do an affine wakeup than before. > > > > Running netperf for this patch and Peter's patch shows that Peter's > > comes out on top, with scores pretty close to v4.12 in most places > > on > > my 2-NUMA node 48-CPU Xeon box. > > > > I haven't dug any further into why v4.13-peterz+ is worse than > > v4.12, > > but I will next week. > > So I was waiting for Rik, who promised to run a bunch of NUMA > workloads > over the weekend. > > The trivial thing regresses a wee bit on the overloaded case, I've > not > yet tried to fix it. In Jirka's tests, your simple patch also came out on top. -- All rights reversed signature.asc Description: This is a digitally signed message part
Re: sysbench throughput degradation in 4.13+
On Mon, Oct 02, 2017 at 11:53:12PM +0100, Matt Fleming wrote: > On Wed, 27 Sep, at 01:58:20PM, Rik van Riel wrote: > > > > I like the simplicity of your approach! I hope it does not break > > stuff like netperf... > > > > I have been working on the patch below, which is much less optimistic > > about when to do an affine wakeup than before. > > Running netperf for this patch and Peter's patch shows that Peter's > comes out on top, with scores pretty close to v4.12 in most places on > my 2-NUMA node 48-CPU Xeon box. > > I haven't dug any further into why v4.13-peterz+ is worse than v4.12, > but I will next week. So I was waiting for Rik, who promised to run a bunch of NUMA workloads over the weekend. The trivial thing regresses a wee bit on the overloaded case, I've not yet tried to fix it.
Re: sysbench throughput degradation in 4.13+
On Mon, Oct 02, 2017 at 11:53:12PM +0100, Matt Fleming wrote: > On Wed, 27 Sep, at 01:58:20PM, Rik van Riel wrote: > > > > I like the simplicity of your approach! I hope it does not break > > stuff like netperf... > > > > I have been working on the patch below, which is much less optimistic > > about when to do an affine wakeup than before. > > Running netperf for this patch and Peter's patch shows that Peter's > comes out on top, with scores pretty close to v4.12 in most places on > my 2-NUMA node 48-CPU Xeon box. > > I haven't dug any further into why v4.13-peterz+ is worse than v4.12, > but I will next week. So I was waiting for Rik, who promised to run a bunch of NUMA workloads over the weekend. The trivial thing regresses a wee bit on the overloaded case, I've not yet tried to fix it.
Re: sysbench throughput degradation in 4.13+
On Wed, 27 Sep, at 01:58:20PM, Rik van Riel wrote: > > I like the simplicity of your approach! I hope it does not break > stuff like netperf... > > I have been working on the patch below, which is much less optimistic > about when to do an affine wakeup than before. Running netperf for this patch and Peter's patch shows that Peter's comes out on top, with scores pretty close to v4.12 in most places on my 2-NUMA node 48-CPU Xeon box. I haven't dug any further into why v4.13-peterz+ is worse than v4.12, but I will next week. netperf-tcp 4.12.0 4.13.0 4.13.0 4.13.0 defaultdefault peterz+ riel+ Min 641653.72 ( 0.00%) 1554.29 ( -6.01%) 1601.71 ( -3.15%) 1627.01 ( -1.62%) Min 128 3240.96 ( 0.00%) 2858.49 ( -11.80%) 3122.62 ( -3.65%) 3063.67 ( -5.47%) Min 256 5840.55 ( 0.00%) 4077.43 ( -30.19%) 5529.98 ( -5.32%) 5362.18 ( -8.19%) Min 1024 16812.11 ( 0.00%)11899.72 ( -29.22%)16335.83 ( -2.83%)15075.24 ( -10.33%) Min 2048 26875.79 ( 0.00%)18852.35 ( -29.85%)25902.85 ( -3.62%)22804.82 ( -15.15%) Min 3312 33060.18 ( 0.00%)20984.28 ( -36.53%)32817.82 ( -0.73%)29161.49 ( -11.79%) Min 4096 34513.24 ( 0.00%)23253.94 ( -32.62%)34167.80 ( -1.00%)29349.09 ( -14.96%) Min 8192 39836.88 ( 0.00%)28881.63 ( -27.50%)39613.28 ( -0.56%)35307.95 ( -11.37%) Min 1638444203.84 ( 0.00%)31616.74 ( -28.48%)43608.86 ( -1.35%)38130.44 ( -13.74%) Hmean 641686.58 ( 0.00%) 1613.25 ( -4.35%) 1657.04 ( -1.75%) 1655.38 ( -1.85%) Hmean 128 3361.84 ( 0.00%) 2945.34 ( -12.39%) 3173.47 ( -5.60%) 3122.38 ( -7.12%) Hmean 256 5993.92 ( 0.00%) 4423.32 ( -26.20%) 5618.26 ( -6.27%) 5523.72 ( -7.84%) Hmean 1024 17225.83 ( 0.00%)12314.23 ( -28.51%)16574.85 ( -3.78%)15644.71 ( -9.18%) Hmean 2048 27944.22 ( 0.00%)21301.63 ( -23.77%)26395.38 ( -5.54%)24067.57 ( -13.87%) Hmean 3312 33760.48 ( 0.00%)22361.07 ( -33.77%)33198.32 ( -1.67%)30055.64 ( -10.97%) Hmean 4096 35077.74 ( 0.00%)29153.73 ( -16.89%)34479.40 ( -1.71%)31215.64 ( -11.01%) Hmean 8192 40674.31 ( 0.00%)33493.01 ( -17.66%)40443.22 ( -0.57%)37298.58 ( -8.30%) Hmean 1638445492.12 ( 0.00%)37177.64 ( -18.28%)44308.62 ( -2.60%)40728.33 ( -10.47%) Max 641745.95 ( 0.00%) 1649.03 ( -5.55%) 1710.52 ( -2.03%) 1702.65 ( -2.48%) Max 128 3509.96 ( 0.00%) 3082.35 ( -12.18%) 3204.19 ( -8.71%) 3174.41 ( -9.56%) Max 256 6138.35 ( 0.00%) 4687.62 ( -23.63%) 5694.52 ( -7.23%) 5722.08 ( -6.78%) Max 1024 17732.13 ( 0.00%)13270.42 ( -25.16%)16838.69 ( -5.04%)16580.18 ( -6.50%) Max 2048 28907.99 ( 0.00%)24816.39 ( -14.15%)26792.86 ( -7.32%)25003.60 ( -13.51%) Max 3312 34512.60 ( 0.00%)23510.32 ( -31.88%)33762.47 ( -2.17%)31676.54 ( -8.22%) Max 4096 35918.95 ( 0.00%)35245.77 ( -1.87%)34866.23 ( -2.93%)32537.07 ( -9.42%) Max 8192 41749.55 ( 0.00%)41068.52 ( -1.63%)42164.53 ( 0.99%)40105.50 ( -3.94%) Max 1638447234.74 ( 0.00%)41728.66 ( -11.66%)45387.40 ( -3.91%)44107.25 ( -6.62%)
Re: sysbench throughput degradation in 4.13+
On Wed, 27 Sep, at 01:58:20PM, Rik van Riel wrote: > > I like the simplicity of your approach! I hope it does not break > stuff like netperf... > > I have been working on the patch below, which is much less optimistic > about when to do an affine wakeup than before. Running netperf for this patch and Peter's patch shows that Peter's comes out on top, with scores pretty close to v4.12 in most places on my 2-NUMA node 48-CPU Xeon box. I haven't dug any further into why v4.13-peterz+ is worse than v4.12, but I will next week. netperf-tcp 4.12.0 4.13.0 4.13.0 4.13.0 defaultdefault peterz+ riel+ Min 641653.72 ( 0.00%) 1554.29 ( -6.01%) 1601.71 ( -3.15%) 1627.01 ( -1.62%) Min 128 3240.96 ( 0.00%) 2858.49 ( -11.80%) 3122.62 ( -3.65%) 3063.67 ( -5.47%) Min 256 5840.55 ( 0.00%) 4077.43 ( -30.19%) 5529.98 ( -5.32%) 5362.18 ( -8.19%) Min 1024 16812.11 ( 0.00%)11899.72 ( -29.22%)16335.83 ( -2.83%)15075.24 ( -10.33%) Min 2048 26875.79 ( 0.00%)18852.35 ( -29.85%)25902.85 ( -3.62%)22804.82 ( -15.15%) Min 3312 33060.18 ( 0.00%)20984.28 ( -36.53%)32817.82 ( -0.73%)29161.49 ( -11.79%) Min 4096 34513.24 ( 0.00%)23253.94 ( -32.62%)34167.80 ( -1.00%)29349.09 ( -14.96%) Min 8192 39836.88 ( 0.00%)28881.63 ( -27.50%)39613.28 ( -0.56%)35307.95 ( -11.37%) Min 1638444203.84 ( 0.00%)31616.74 ( -28.48%)43608.86 ( -1.35%)38130.44 ( -13.74%) Hmean 641686.58 ( 0.00%) 1613.25 ( -4.35%) 1657.04 ( -1.75%) 1655.38 ( -1.85%) Hmean 128 3361.84 ( 0.00%) 2945.34 ( -12.39%) 3173.47 ( -5.60%) 3122.38 ( -7.12%) Hmean 256 5993.92 ( 0.00%) 4423.32 ( -26.20%) 5618.26 ( -6.27%) 5523.72 ( -7.84%) Hmean 1024 17225.83 ( 0.00%)12314.23 ( -28.51%)16574.85 ( -3.78%)15644.71 ( -9.18%) Hmean 2048 27944.22 ( 0.00%)21301.63 ( -23.77%)26395.38 ( -5.54%)24067.57 ( -13.87%) Hmean 3312 33760.48 ( 0.00%)22361.07 ( -33.77%)33198.32 ( -1.67%)30055.64 ( -10.97%) Hmean 4096 35077.74 ( 0.00%)29153.73 ( -16.89%)34479.40 ( -1.71%)31215.64 ( -11.01%) Hmean 8192 40674.31 ( 0.00%)33493.01 ( -17.66%)40443.22 ( -0.57%)37298.58 ( -8.30%) Hmean 1638445492.12 ( 0.00%)37177.64 ( -18.28%)44308.62 ( -2.60%)40728.33 ( -10.47%) Max 641745.95 ( 0.00%) 1649.03 ( -5.55%) 1710.52 ( -2.03%) 1702.65 ( -2.48%) Max 128 3509.96 ( 0.00%) 3082.35 ( -12.18%) 3204.19 ( -8.71%) 3174.41 ( -9.56%) Max 256 6138.35 ( 0.00%) 4687.62 ( -23.63%) 5694.52 ( -7.23%) 5722.08 ( -6.78%) Max 1024 17732.13 ( 0.00%)13270.42 ( -25.16%)16838.69 ( -5.04%)16580.18 ( -6.50%) Max 2048 28907.99 ( 0.00%)24816.39 ( -14.15%)26792.86 ( -7.32%)25003.60 ( -13.51%) Max 3312 34512.60 ( 0.00%)23510.32 ( -31.88%)33762.47 ( -2.17%)31676.54 ( -8.22%) Max 4096 35918.95 ( 0.00%)35245.77 ( -1.87%)34866.23 ( -2.93%)32537.07 ( -9.42%) Max 8192 41749.55 ( 0.00%)41068.52 ( -1.63%)42164.53 ( 0.99%)40105.50 ( -3.94%) Max 1638447234.74 ( 0.00%)41728.66 ( -11.66%)45387.40 ( -3.91%)44107.25 ( -6.62%)
Re: sysbench throughput degradation in 4.13+
On Wed, Sep 27, 2017 at 01:58:20PM -0400, Rik van Riel wrote: > @@ -5359,10 +5378,14 @@ wake_affine_llc(struct sched_domain *sd, struct > task_struct *p, > unsigned long current_load = task_h_load(current); > > /* in this case load hits 0 and this LLC is considered 'idle' */ > - if (current_load > this_stats.load) > + if (current_load > this_stats.max_load) > + return true; > + > + /* allow if the CPU would go idle, regardless of LLC load */ > + if (current_load >= target_load(this_cpu, sd->wake_idx)) > return true; > > - this_stats.load -= current_load; > + this_stats.max_load -= current_load; > } > > /* > @@ -5375,10 +5398,6 @@ wake_affine_llc(struct sched_domain *sd, struct > task_struct *p, > if (prev_stats.has_capacity && prev_stats.nr_running < > this_stats.nr_running+1) > return false; > > - /* if this cache has capacity, come here */ > - if (this_stats.has_capacity && this_stats.nr_running+1 < > prev_stats.nr_running) > - return true; > - > /* >* Check to see if we can move the load without causing too much >* imbalance. > @@ -5391,8 +5410,8 @@ wake_affine_llc(struct sched_domain *sd, struct > task_struct *p, > prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2; > prev_eff_load *= this_stats.capacity; > > - this_eff_load *= this_stats.load + task_load; > - prev_eff_load *= prev_stats.load - task_load; > + this_eff_load *= this_stats.max_load + task_load; > + prev_eff_load *= prev_stats.min_load - task_load; > > return this_eff_load <= prev_eff_load; > } So I would really like a workload that needs this LLC/NUMA stuff. Because I much prefer the simpler: 'on which of these two CPUs can I run soonest' approach.
Re: sysbench throughput degradation in 4.13+
On Wed, Sep 27, 2017 at 01:58:20PM -0400, Rik van Riel wrote: > @@ -5359,10 +5378,14 @@ wake_affine_llc(struct sched_domain *sd, struct > task_struct *p, > unsigned long current_load = task_h_load(current); > > /* in this case load hits 0 and this LLC is considered 'idle' */ > - if (current_load > this_stats.load) > + if (current_load > this_stats.max_load) > + return true; > + > + /* allow if the CPU would go idle, regardless of LLC load */ > + if (current_load >= target_load(this_cpu, sd->wake_idx)) > return true; > > - this_stats.load -= current_load; > + this_stats.max_load -= current_load; > } > > /* > @@ -5375,10 +5398,6 @@ wake_affine_llc(struct sched_domain *sd, struct > task_struct *p, > if (prev_stats.has_capacity && prev_stats.nr_running < > this_stats.nr_running+1) > return false; > > - /* if this cache has capacity, come here */ > - if (this_stats.has_capacity && this_stats.nr_running+1 < > prev_stats.nr_running) > - return true; > - > /* >* Check to see if we can move the load without causing too much >* imbalance. > @@ -5391,8 +5410,8 @@ wake_affine_llc(struct sched_domain *sd, struct > task_struct *p, > prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2; > prev_eff_load *= this_stats.capacity; > > - this_eff_load *= this_stats.load + task_load; > - prev_eff_load *= prev_stats.load - task_load; > + this_eff_load *= this_stats.max_load + task_load; > + prev_eff_load *= prev_stats.min_load - task_load; > > return this_eff_load <= prev_eff_load; > } So I would really like a workload that needs this LLC/NUMA stuff. Because I much prefer the simpler: 'on which of these two CPUs can I run soonest' approach.
Re: sysbench throughput degradation in 4.13+
On Wed, Sep 27, 2017 at 01:58:20PM -0400, Rik van Riel wrote: > I like the simplicity of your approach! I hope it does not break > stuff like netperf... So the old approach that looks at the weight of the two CPUs behaves slightly better in the overloaded case. On the threads==nr_cpus load points they match fairly evenly. I seem to have misplaced my netperf scripts, but I'll have a play with it.
Re: sysbench throughput degradation in 4.13+
On Wed, Sep 27, 2017 at 01:58:20PM -0400, Rik van Riel wrote: > I like the simplicity of your approach! I hope it does not break > stuff like netperf... So the old approach that looks at the weight of the two CPUs behaves slightly better in the overloaded case. On the threads==nr_cpus load points they match fairly evenly. I seem to have misplaced my netperf scripts, but I'll have a play with it.
Re: sysbench throughput degradation in 4.13+
On 09/27/2017 01:58 PM, Rik van Riel wrote: On Wed, 27 Sep 2017 11:35:30 +0200 Peter Zijlstrawrote: On Fri, Sep 22, 2017 at 12:12:45PM -0400, Eric Farman wrote: MySQL. We've tried a few different configs with both test=oltp and test=threads, but both show the same behavior. What I have settled on for my repro is the following: Right, didn't even need to run it in a guest to observe a regression. So the below cures native sysbench and NAS bench for me, does it also work for you virt thingy? PRE (current tip/master): ivb-ex sysbench: 2: [30 secs] transactions:64110 (2136.94 per sec.) 5: [30 secs] transactions:143644 (4787.99 per sec.) 10: [30 secs] transactions:274298 (9142.93 per sec.) 20: [30 secs] transactions:418683 (13955.45 per sec.) 40: [30 secs] transactions:320731 (10690.15 per sec.) 80: [30 secs] transactions:355096 (11834.28 per sec.) hsw-ex NAS: OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds = 18.01 OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds = 17.89 OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds = 17.93 lu.C.x_threads_144_run_1.log: Time in seconds = 434.68 lu.C.x_threads_144_run_2.log: Time in seconds = 405.36 lu.C.x_threads_144_run_3.log: Time in seconds = 433.83 POST (+patch): ivb-ex sysbench: 2: [30 secs] transactions:64494 (2149.75 per sec.) 5: [30 secs] transactions:145114 (4836.99 per sec.) 10: [30 secs] transactions:278311 (9276.69 per sec.) 20: [30 secs] transactions:437169 (14571.60 per sec.) 40: [30 secs] transactions:669837 (22326.73 per sec.) 80: [30 secs] transactions:631739 (21055.88 per sec.) hsw-ex NAS: lu.C.x_threads_144_run_1.log: Time in seconds =23.36 lu.C.x_threads_144_run_2.log: Time in seconds =22.96 lu.C.x_threads_144_run_3.log: Time in seconds =22.52 This patch takes out all the shiny wake_affine stuff and goes back to utter basics. Rik was there another NUMA benchmark that wanted your fancy stuff? Because NAS isn't it. I like the simplicity of your approach! I hope it does not break stuff like netperf... I have been working on the patch below, which is much less optimistic about when to do an affine wakeup than before. It may be worth testing, in case it works better with some workload, though relying on cached values still makes me somewhat uneasy. Here are numbers for our environment, to compare the two patches: sysbench --test=threads: next-20170926: 25470.8 -with-Peters-patch: 29559.1 -with-Riks-patch: 29283 sysbench --test=oltp: next-20170926: 5722.37 -with-Peters-patch: 9623.45 -with-Riks-patch: 9360.59 Didn't record host cpu migrations in all scenarios, but a spot check showed a similar reduction in both patches. - Eric I will try to get kernels tested here that implement both approaches, to see what ends up working best. ---8<--- Subject: sched: make wake_affine_llc less eager With the wake_affine_llc logic, tasks get moved around too eagerly, and then moved back later, leading to poor performance for some workloads. Make wake_affine_llc less eager by comparing the minimum load of the source LLC with the maximum load of the destination LLC, similar to how source_load and target_load work for regular migration. Also, get rid of an overly optimistic test that could potentially pull across a lot of tasks if the target LLC happened to have fewer runnable tasks at load balancing time. Conversely, sync wakeups could happen without taking LLC loads into account, if the waker would leave an idle CPU behind on the target LLC. Signed-off-by: Rik van Riel --- include/linux/sched/topology.h | 3 ++- kernel/sched/fair.c| 56 +- 2 files changed, 46 insertions(+), 13 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index d7b6dab956ec..0c295ff5049b 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -77,7 +77,8 @@ struct sched_domain_shared { * used by wake_affine(). */ unsigned long nr_running; - unsigned long load; + unsigned long min_load; + unsigned long max_load; unsigned long capacity; }; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 86195add977f..7740c6776e08 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5239,6 +5239,23 @@ static
Re: sysbench throughput degradation in 4.13+
On 09/27/2017 01:58 PM, Rik van Riel wrote: On Wed, 27 Sep 2017 11:35:30 +0200 Peter Zijlstra wrote: On Fri, Sep 22, 2017 at 12:12:45PM -0400, Eric Farman wrote: MySQL. We've tried a few different configs with both test=oltp and test=threads, but both show the same behavior. What I have settled on for my repro is the following: Right, didn't even need to run it in a guest to observe a regression. So the below cures native sysbench and NAS bench for me, does it also work for you virt thingy? PRE (current tip/master): ivb-ex sysbench: 2: [30 secs] transactions:64110 (2136.94 per sec.) 5: [30 secs] transactions:143644 (4787.99 per sec.) 10: [30 secs] transactions:274298 (9142.93 per sec.) 20: [30 secs] transactions:418683 (13955.45 per sec.) 40: [30 secs] transactions:320731 (10690.15 per sec.) 80: [30 secs] transactions:355096 (11834.28 per sec.) hsw-ex NAS: OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds = 18.01 OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds = 17.89 OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds = 17.93 lu.C.x_threads_144_run_1.log: Time in seconds = 434.68 lu.C.x_threads_144_run_2.log: Time in seconds = 405.36 lu.C.x_threads_144_run_3.log: Time in seconds = 433.83 POST (+patch): ivb-ex sysbench: 2: [30 secs] transactions:64494 (2149.75 per sec.) 5: [30 secs] transactions:145114 (4836.99 per sec.) 10: [30 secs] transactions:278311 (9276.69 per sec.) 20: [30 secs] transactions:437169 (14571.60 per sec.) 40: [30 secs] transactions:669837 (22326.73 per sec.) 80: [30 secs] transactions:631739 (21055.88 per sec.) hsw-ex NAS: lu.C.x_threads_144_run_1.log: Time in seconds =23.36 lu.C.x_threads_144_run_2.log: Time in seconds =22.96 lu.C.x_threads_144_run_3.log: Time in seconds =22.52 This patch takes out all the shiny wake_affine stuff and goes back to utter basics. Rik was there another NUMA benchmark that wanted your fancy stuff? Because NAS isn't it. I like the simplicity of your approach! I hope it does not break stuff like netperf... I have been working on the patch below, which is much less optimistic about when to do an affine wakeup than before. It may be worth testing, in case it works better with some workload, though relying on cached values still makes me somewhat uneasy. Here are numbers for our environment, to compare the two patches: sysbench --test=threads: next-20170926: 25470.8 -with-Peters-patch: 29559.1 -with-Riks-patch: 29283 sysbench --test=oltp: next-20170926: 5722.37 -with-Peters-patch: 9623.45 -with-Riks-patch: 9360.59 Didn't record host cpu migrations in all scenarios, but a spot check showed a similar reduction in both patches. - Eric I will try to get kernels tested here that implement both approaches, to see what ends up working best. ---8<--- Subject: sched: make wake_affine_llc less eager With the wake_affine_llc logic, tasks get moved around too eagerly, and then moved back later, leading to poor performance for some workloads. Make wake_affine_llc less eager by comparing the minimum load of the source LLC with the maximum load of the destination LLC, similar to how source_load and target_load work for regular migration. Also, get rid of an overly optimistic test that could potentially pull across a lot of tasks if the target LLC happened to have fewer runnable tasks at load balancing time. Conversely, sync wakeups could happen without taking LLC loads into account, if the waker would leave an idle CPU behind on the target LLC. Signed-off-by: Rik van Riel --- include/linux/sched/topology.h | 3 ++- kernel/sched/fair.c| 56 +- 2 files changed, 46 insertions(+), 13 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index d7b6dab956ec..0c295ff5049b 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -77,7 +77,8 @@ struct sched_domain_shared { * used by wake_affine(). */ unsigned long nr_running; - unsigned long load; + unsigned long min_load; + unsigned long max_load; unsigned long capacity; }; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 86195add977f..7740c6776e08 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5239,6 +5239,23 @@ static unsigned long target_load(int cpu, int
Re: sysbench throughput degradation in 4.13+
On 09/27/2017 06:27 PM, Eric Farman wrote: > > > On 09/27/2017 05:35 AM, Peter Zijlstra wrote: >> On Fri, Sep 22, 2017 at 12:12:45PM -0400, Eric Farman wrote: >>> >>> MySQL. We've tried a few different configs with both test=oltp and >>> test=threads, but both show the same behavior. What I have settled on for >>> my repro is the following: >>> >> >> Right, didn't even need to run it in a guest to observe a regression. >> >> So the below cures native sysbench and NAS bench for me, does it also >> work for you virt thingy? [...] > --test=oltp > Baseline:5655.26 transactions/second > Patched:9618.13 transactions/second > > --test=threads > Baseline:25482.9 events/sec > Patched:29577.9 events/sec > > That's good! With that... > > Tested-by: Eric Farman> > Thanks! > > - Eric Assuming that we settle on this or Riks alternative patch. Are we going to schedule this for 4.13 stable as well?
Re: sysbench throughput degradation in 4.13+
On 09/27/2017 06:27 PM, Eric Farman wrote: > > > On 09/27/2017 05:35 AM, Peter Zijlstra wrote: >> On Fri, Sep 22, 2017 at 12:12:45PM -0400, Eric Farman wrote: >>> >>> MySQL. We've tried a few different configs with both test=oltp and >>> test=threads, but both show the same behavior. What I have settled on for >>> my repro is the following: >>> >> >> Right, didn't even need to run it in a guest to observe a regression. >> >> So the below cures native sysbench and NAS bench for me, does it also >> work for you virt thingy? [...] > --test=oltp > Baseline:5655.26 transactions/second > Patched:9618.13 transactions/second > > --test=threads > Baseline:25482.9 events/sec > Patched:29577.9 events/sec > > That's good! With that... > > Tested-by: Eric Farman > > Thanks! > > - Eric Assuming that we settle on this or Riks alternative patch. Are we going to schedule this for 4.13 stable as well?
Re: sysbench throughput degradation in 4.13+
On Wed, 27 Sep 2017 11:35:30 +0200 Peter Zijlstrawrote: > On Fri, Sep 22, 2017 at 12:12:45PM -0400, Eric Farman wrote: > > > > MySQL. We've tried a few different configs with both test=oltp and > > test=threads, but both show the same behavior. What I have settled on for > > my repro is the following: > > > > Right, didn't even need to run it in a guest to observe a regression. > > So the below cures native sysbench and NAS bench for me, does it also > work for you virt thingy? > > > PRE (current tip/master): > > ivb-ex sysbench: > > 2: [30 secs] transactions:64110 (2136.94 per > sec.) > 5: [30 secs] transactions:143644 (4787.99 per > sec.) > 10: [30 secs] transactions:274298 (9142.93 per > sec.) > 20: [30 secs] transactions:418683 (13955.45 per > sec.) > 40: [30 secs] transactions:320731 (10690.15 per > sec.) > 80: [30 secs] transactions:355096 (11834.28 per > sec.) > > hsw-ex NAS: > > OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds = >18.01 > OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds = >17.89 > OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds = >17.93 > lu.C.x_threads_144_run_1.log: Time in seconds = 434.68 > lu.C.x_threads_144_run_2.log: Time in seconds = 405.36 > lu.C.x_threads_144_run_3.log: Time in seconds = 433.83 > > > POST (+patch): > > ivb-ex sysbench: > > 2: [30 secs] transactions:64494 (2149.75 per > sec.) > 5: [30 secs] transactions:145114 (4836.99 per > sec.) > 10: [30 secs] transactions:278311 (9276.69 per > sec.) > 20: [30 secs] transactions:437169 (14571.60 per > sec.) > 40: [30 secs] transactions:669837 (22326.73 per > sec.) > 80: [30 secs] transactions:631739 (21055.88 per > sec.) > > hsw-ex NAS: > > lu.C.x_threads_144_run_1.log: Time in seconds =23.36 > lu.C.x_threads_144_run_2.log: Time in seconds =22.96 > lu.C.x_threads_144_run_3.log: Time in seconds =22.52 > > > This patch takes out all the shiny wake_affine stuff and goes back to > utter basics. Rik was there another NUMA benchmark that wanted your > fancy stuff? Because NAS isn't it. I like the simplicity of your approach! I hope it does not break stuff like netperf... I have been working on the patch below, which is much less optimistic about when to do an affine wakeup than before. It may be worth testing, in case it works better with some workload, though relying on cached values still makes me somewhat uneasy. I will try to get kernels tested here that implement both approaches, to see what ends up working best. ---8<--- Subject: sched: make wake_affine_llc less eager With the wake_affine_llc logic, tasks get moved around too eagerly, and then moved back later, leading to poor performance for some workloads. Make wake_affine_llc less eager by comparing the minimum load of the source LLC with the maximum load of the destination LLC, similar to how source_load and target_load work for regular migration. Also, get rid of an overly optimistic test that could potentially pull across a lot of tasks if the target LLC happened to have fewer runnable tasks at load balancing time. Conversely, sync wakeups could happen without taking LLC loads into account, if the waker would leave an idle CPU behind on the target LLC. Signed-off-by: Rik van Riel --- include/linux/sched/topology.h | 3 ++- kernel/sched/fair.c| 56 +- 2 files changed, 46 insertions(+), 13 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index d7b6dab956ec..0c295ff5049b 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -77,7 +77,8 @@ struct sched_domain_shared { * used by wake_affine(). */ unsigned long nr_running; - unsigned long load; + unsigned long min_load; + unsigned long max_load; unsigned long capacity; }; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 86195add977f..7740c6776e08 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5239,6 +5239,23 @@ static unsigned long target_load(int cpu, int type) return max(rq->cpu_load[type-1], total); } +static void min_max_load(int cpu, unsigned long *min_load, +unsigned long *max_load) +{ + struct rq *rq = cpu_rq(cpu); + unsigned long minl = ULONG_MAX; + unsigned long maxl = 0; + int i; + + for (i
Re: sysbench throughput degradation in 4.13+
On Wed, 27 Sep 2017 11:35:30 +0200 Peter Zijlstra wrote: > On Fri, Sep 22, 2017 at 12:12:45PM -0400, Eric Farman wrote: > > > > MySQL. We've tried a few different configs with both test=oltp and > > test=threads, but both show the same behavior. What I have settled on for > > my repro is the following: > > > > Right, didn't even need to run it in a guest to observe a regression. > > So the below cures native sysbench and NAS bench for me, does it also > work for you virt thingy? > > > PRE (current tip/master): > > ivb-ex sysbench: > > 2: [30 secs] transactions:64110 (2136.94 per > sec.) > 5: [30 secs] transactions:143644 (4787.99 per > sec.) > 10: [30 secs] transactions:274298 (9142.93 per > sec.) > 20: [30 secs] transactions:418683 (13955.45 per > sec.) > 40: [30 secs] transactions:320731 (10690.15 per > sec.) > 80: [30 secs] transactions:355096 (11834.28 per > sec.) > > hsw-ex NAS: > > OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds = >18.01 > OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds = >17.89 > OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds = >17.93 > lu.C.x_threads_144_run_1.log: Time in seconds = 434.68 > lu.C.x_threads_144_run_2.log: Time in seconds = 405.36 > lu.C.x_threads_144_run_3.log: Time in seconds = 433.83 > > > POST (+patch): > > ivb-ex sysbench: > > 2: [30 secs] transactions:64494 (2149.75 per > sec.) > 5: [30 secs] transactions:145114 (4836.99 per > sec.) > 10: [30 secs] transactions:278311 (9276.69 per > sec.) > 20: [30 secs] transactions:437169 (14571.60 per > sec.) > 40: [30 secs] transactions:669837 (22326.73 per > sec.) > 80: [30 secs] transactions:631739 (21055.88 per > sec.) > > hsw-ex NAS: > > lu.C.x_threads_144_run_1.log: Time in seconds =23.36 > lu.C.x_threads_144_run_2.log: Time in seconds =22.96 > lu.C.x_threads_144_run_3.log: Time in seconds =22.52 > > > This patch takes out all the shiny wake_affine stuff and goes back to > utter basics. Rik was there another NUMA benchmark that wanted your > fancy stuff? Because NAS isn't it. I like the simplicity of your approach! I hope it does not break stuff like netperf... I have been working on the patch below, which is much less optimistic about when to do an affine wakeup than before. It may be worth testing, in case it works better with some workload, though relying on cached values still makes me somewhat uneasy. I will try to get kernels tested here that implement both approaches, to see what ends up working best. ---8<--- Subject: sched: make wake_affine_llc less eager With the wake_affine_llc logic, tasks get moved around too eagerly, and then moved back later, leading to poor performance for some workloads. Make wake_affine_llc less eager by comparing the minimum load of the source LLC with the maximum load of the destination LLC, similar to how source_load and target_load work for regular migration. Also, get rid of an overly optimistic test that could potentially pull across a lot of tasks if the target LLC happened to have fewer runnable tasks at load balancing time. Conversely, sync wakeups could happen without taking LLC loads into account, if the waker would leave an idle CPU behind on the target LLC. Signed-off-by: Rik van Riel --- include/linux/sched/topology.h | 3 ++- kernel/sched/fair.c| 56 +- 2 files changed, 46 insertions(+), 13 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index d7b6dab956ec..0c295ff5049b 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -77,7 +77,8 @@ struct sched_domain_shared { * used by wake_affine(). */ unsigned long nr_running; - unsigned long load; + unsigned long min_load; + unsigned long max_load; unsigned long capacity; }; diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 86195add977f..7740c6776e08 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5239,6 +5239,23 @@ static unsigned long target_load(int cpu, int type) return max(rq->cpu_load[type-1], total); } +static void min_max_load(int cpu, unsigned long *min_load, +unsigned long *max_load) +{ + struct rq *rq = cpu_rq(cpu); + unsigned long minl = ULONG_MAX; + unsigned long maxl = 0; + int i; + + for (i = 0; i < CPU_LOAD_IDX_MAX; i++) { +
Re: sysbench throughput degradation in 4.13+
On 09/27/2017 05:35 AM, Peter Zijlstra wrote: On Fri, Sep 22, 2017 at 12:12:45PM -0400, Eric Farman wrote: MySQL. We've tried a few different configs with both test=oltp and test=threads, but both show the same behavior. What I have settled on for my repro is the following: Right, didn't even need to run it in a guest to observe a regression. So the below cures native sysbench and NAS bench for me, does it also work for you virt thingy? Ran a quick test this morning with 4.13.0 + 90001d67be2f + a731ebe6f17b and then with/without this patch. An oltp sysbench run shows that guest cpu migrations decreased significantly, from ~27K to ~2K over 5 seconds. So, we applied this patch to linux-next (next-20170926) and ran it against a couple sysbench tests: --test=oltp Baseline: 5655.26 transactions/second Patched:9618.13 transactions/second --test=threads Baseline: 25482.9 events/sec Patched:29577.9 events/sec That's good! With that... Tested-by: Eric FarmanThanks! - Eric PRE (current tip/master): ivb-ex sysbench: 2: [30 secs] transactions:64110 (2136.94 per sec.) 5: [30 secs] transactions:143644 (4787.99 per sec.) 10: [30 secs] transactions:274298 (9142.93 per sec.) 20: [30 secs] transactions:418683 (13955.45 per sec.) 40: [30 secs] transactions:320731 (10690.15 per sec.) 80: [30 secs] transactions:355096 (11834.28 per sec.) hsw-ex NAS: OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds = 18.01 OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds = 17.89 OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds = 17.93 lu.C.x_threads_144_run_1.log: Time in seconds = 434.68 lu.C.x_threads_144_run_2.log: Time in seconds = 405.36 lu.C.x_threads_144_run_3.log: Time in seconds = 433.83 POST (+patch): ivb-ex sysbench: 2: [30 secs] transactions:64494 (2149.75 per sec.) 5: [30 secs] transactions:145114 (4836.99 per sec.) 10: [30 secs] transactions:278311 (9276.69 per sec.) 20: [30 secs] transactions:437169 (14571.60 per sec.) 40: [30 secs] transactions:669837 (22326.73 per sec.) 80: [30 secs] transactions:631739 (21055.88 per sec.) hsw-ex NAS: lu.C.x_threads_144_run_1.log: Time in seconds =23.36 lu.C.x_threads_144_run_2.log: Time in seconds =22.96 lu.C.x_threads_144_run_3.log: Time in seconds =22.52 This patch takes out all the shiny wake_affine stuff and goes back to utter basics. Rik was there another NUMA benchmark that wanted your fancy stuff? Because NAS isn't it. (the previous, slightly fancier wake_affine was basically a !idle extension of this, in that it would pick the 'shortest' of the 2 queues and thereby run quickest, in approximation) I'll try and run a number of other benchmarks I have around to see if there's anything that shows a difference between the below trivial wake_affine and the old 2-cpu-load one. --- include/linux/sched/topology.h | 8 --- kernel/sched/fair.c| 125 ++--- 2 files changed, 6 insertions(+), 127 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index d7b6dab956ec..7d065abc7a47 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -71,14 +71,6 @@ struct sched_domain_shared { atomic_tref; atomic_tnr_busy_cpus; int has_idle_cores; - - /* -* Some variables from the most recent sd_lb_stats for this domain, -* used by wake_affine(). -*/ - unsigned long nr_running; - unsigned long load; - unsigned long capacity; }; struct sched_domain { diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 70ba32e08a23..66930ce338af 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5356,115 +5356,19 @@ static int wake_wide(struct task_struct *p) return 1; } -struct llc_stats { - unsigned long nr_running; - unsigned long load; - unsigned long capacity; - int has_capacity; -}; - -static bool get_llc_stats(struct llc_stats *stats, int cpu) -{ - struct sched_domain_shared *sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); - - if (!sds) - return false; - - stats->nr_running= READ_ONCE(sds->nr_running); - stats->load = READ_ONCE(sds->load); - stats->capacity =
Re: sysbench throughput degradation in 4.13+
On 09/27/2017 05:35 AM, Peter Zijlstra wrote: On Fri, Sep 22, 2017 at 12:12:45PM -0400, Eric Farman wrote: MySQL. We've tried a few different configs with both test=oltp and test=threads, but both show the same behavior. What I have settled on for my repro is the following: Right, didn't even need to run it in a guest to observe a regression. So the below cures native sysbench and NAS bench for me, does it also work for you virt thingy? Ran a quick test this morning with 4.13.0 + 90001d67be2f + a731ebe6f17b and then with/without this patch. An oltp sysbench run shows that guest cpu migrations decreased significantly, from ~27K to ~2K over 5 seconds. So, we applied this patch to linux-next (next-20170926) and ran it against a couple sysbench tests: --test=oltp Baseline: 5655.26 transactions/second Patched:9618.13 transactions/second --test=threads Baseline: 25482.9 events/sec Patched:29577.9 events/sec That's good! With that... Tested-by: Eric Farman Thanks! - Eric PRE (current tip/master): ivb-ex sysbench: 2: [30 secs] transactions:64110 (2136.94 per sec.) 5: [30 secs] transactions:143644 (4787.99 per sec.) 10: [30 secs] transactions:274298 (9142.93 per sec.) 20: [30 secs] transactions:418683 (13955.45 per sec.) 40: [30 secs] transactions:320731 (10690.15 per sec.) 80: [30 secs] transactions:355096 (11834.28 per sec.) hsw-ex NAS: OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds = 18.01 OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds = 17.89 OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds = 17.93 lu.C.x_threads_144_run_1.log: Time in seconds = 434.68 lu.C.x_threads_144_run_2.log: Time in seconds = 405.36 lu.C.x_threads_144_run_3.log: Time in seconds = 433.83 POST (+patch): ivb-ex sysbench: 2: [30 secs] transactions:64494 (2149.75 per sec.) 5: [30 secs] transactions:145114 (4836.99 per sec.) 10: [30 secs] transactions:278311 (9276.69 per sec.) 20: [30 secs] transactions:437169 (14571.60 per sec.) 40: [30 secs] transactions:669837 (22326.73 per sec.) 80: [30 secs] transactions:631739 (21055.88 per sec.) hsw-ex NAS: lu.C.x_threads_144_run_1.log: Time in seconds =23.36 lu.C.x_threads_144_run_2.log: Time in seconds =22.96 lu.C.x_threads_144_run_3.log: Time in seconds =22.52 This patch takes out all the shiny wake_affine stuff and goes back to utter basics. Rik was there another NUMA benchmark that wanted your fancy stuff? Because NAS isn't it. (the previous, slightly fancier wake_affine was basically a !idle extension of this, in that it would pick the 'shortest' of the 2 queues and thereby run quickest, in approximation) I'll try and run a number of other benchmarks I have around to see if there's anything that shows a difference between the below trivial wake_affine and the old 2-cpu-load one. --- include/linux/sched/topology.h | 8 --- kernel/sched/fair.c| 125 ++--- 2 files changed, 6 insertions(+), 127 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index d7b6dab956ec..7d065abc7a47 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -71,14 +71,6 @@ struct sched_domain_shared { atomic_tref; atomic_tnr_busy_cpus; int has_idle_cores; - - /* -* Some variables from the most recent sd_lb_stats for this domain, -* used by wake_affine(). -*/ - unsigned long nr_running; - unsigned long load; - unsigned long capacity; }; struct sched_domain { diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 70ba32e08a23..66930ce338af 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5356,115 +5356,19 @@ static int wake_wide(struct task_struct *p) return 1; } -struct llc_stats { - unsigned long nr_running; - unsigned long load; - unsigned long capacity; - int has_capacity; -}; - -static bool get_llc_stats(struct llc_stats *stats, int cpu) -{ - struct sched_domain_shared *sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); - - if (!sds) - return false; - - stats->nr_running= READ_ONCE(sds->nr_running); - stats->load = READ_ONCE(sds->load); - stats->capacity = READ_ONCE(sds->capacity); -
Re: sysbench throughput degradation in 4.13+
On Fri, Sep 22, 2017 at 12:12:45PM -0400, Eric Farman wrote: > > MySQL. We've tried a few different configs with both test=oltp and > test=threads, but both show the same behavior. What I have settled on for > my repro is the following: > Right, didn't even need to run it in a guest to observe a regression. So the below cures native sysbench and NAS bench for me, does it also work for you virt thingy? PRE (current tip/master): ivb-ex sysbench: 2: [30 secs] transactions:64110 (2136.94 per sec.) 5: [30 secs] transactions:143644 (4787.99 per sec.) 10: [30 secs] transactions:274298 (9142.93 per sec.) 20: [30 secs] transactions:418683 (13955.45 per sec.) 40: [30 secs] transactions:320731 (10690.15 per sec.) 80: [30 secs] transactions:355096 (11834.28 per sec.) hsw-ex NAS: OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds = 18.01 OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds = 17.89 OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds = 17.93 lu.C.x_threads_144_run_1.log: Time in seconds = 434.68 lu.C.x_threads_144_run_2.log: Time in seconds = 405.36 lu.C.x_threads_144_run_3.log: Time in seconds = 433.83 POST (+patch): ivb-ex sysbench: 2: [30 secs] transactions:64494 (2149.75 per sec.) 5: [30 secs] transactions:145114 (4836.99 per sec.) 10: [30 secs] transactions:278311 (9276.69 per sec.) 20: [30 secs] transactions:437169 (14571.60 per sec.) 40: [30 secs] transactions:669837 (22326.73 per sec.) 80: [30 secs] transactions:631739 (21055.88 per sec.) hsw-ex NAS: lu.C.x_threads_144_run_1.log: Time in seconds =23.36 lu.C.x_threads_144_run_2.log: Time in seconds =22.96 lu.C.x_threads_144_run_3.log: Time in seconds =22.52 This patch takes out all the shiny wake_affine stuff and goes back to utter basics. Rik was there another NUMA benchmark that wanted your fancy stuff? Because NAS isn't it. (the previous, slightly fancier wake_affine was basically a !idle extension of this, in that it would pick the 'shortest' of the 2 queues and thereby run quickest, in approximation) I'll try and run a number of other benchmarks I have around to see if there's anything that shows a difference between the below trivial wake_affine and the old 2-cpu-load one. --- include/linux/sched/topology.h | 8 --- kernel/sched/fair.c| 125 ++--- 2 files changed, 6 insertions(+), 127 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index d7b6dab956ec..7d065abc7a47 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -71,14 +71,6 @@ struct sched_domain_shared { atomic_tref; atomic_tnr_busy_cpus; int has_idle_cores; - - /* -* Some variables from the most recent sd_lb_stats for this domain, -* used by wake_affine(). -*/ - unsigned long nr_running; - unsigned long load; - unsigned long capacity; }; struct sched_domain { diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 70ba32e08a23..66930ce338af 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5356,115 +5356,19 @@ static int wake_wide(struct task_struct *p) return 1; } -struct llc_stats { - unsigned long nr_running; - unsigned long load; - unsigned long capacity; - int has_capacity; -}; - -static bool get_llc_stats(struct llc_stats *stats, int cpu) -{ - struct sched_domain_shared *sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); - - if (!sds) - return false; - - stats->nr_running = READ_ONCE(sds->nr_running); - stats->load = READ_ONCE(sds->load); - stats->capacity = READ_ONCE(sds->capacity); - stats->has_capacity = stats->nr_running < per_cpu(sd_llc_size, cpu); - - return true; -} - -/* - * Can a task be moved from prev_cpu to this_cpu without causing a load - * imbalance that would trigger the load balancer? - * - * Since we're running on 'stale' values, we might in fact create an imbalance - * but recomputing these values is expensive, as that'd mean iteration 2 cache - * domains worth of CPUs. - */ -static bool -wake_affine_llc(struct sched_domain *sd, struct task_struct *p, - int this_cpu, int prev_cpu, int sync) -{ - struct llc_stats prev_stats, this_stats; - s64 this_eff_load, prev_eff_load;
Re: sysbench throughput degradation in 4.13+
On Fri, Sep 22, 2017 at 12:12:45PM -0400, Eric Farman wrote: > > MySQL. We've tried a few different configs with both test=oltp and > test=threads, but both show the same behavior. What I have settled on for > my repro is the following: > Right, didn't even need to run it in a guest to observe a regression. So the below cures native sysbench and NAS bench for me, does it also work for you virt thingy? PRE (current tip/master): ivb-ex sysbench: 2: [30 secs] transactions:64110 (2136.94 per sec.) 5: [30 secs] transactions:143644 (4787.99 per sec.) 10: [30 secs] transactions:274298 (9142.93 per sec.) 20: [30 secs] transactions:418683 (13955.45 per sec.) 40: [30 secs] transactions:320731 (10690.15 per sec.) 80: [30 secs] transactions:355096 (11834.28 per sec.) hsw-ex NAS: OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds = 18.01 OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds = 17.89 OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds = 17.93 lu.C.x_threads_144_run_1.log: Time in seconds = 434.68 lu.C.x_threads_144_run_2.log: Time in seconds = 405.36 lu.C.x_threads_144_run_3.log: Time in seconds = 433.83 POST (+patch): ivb-ex sysbench: 2: [30 secs] transactions:64494 (2149.75 per sec.) 5: [30 secs] transactions:145114 (4836.99 per sec.) 10: [30 secs] transactions:278311 (9276.69 per sec.) 20: [30 secs] transactions:437169 (14571.60 per sec.) 40: [30 secs] transactions:669837 (22326.73 per sec.) 80: [30 secs] transactions:631739 (21055.88 per sec.) hsw-ex NAS: lu.C.x_threads_144_run_1.log: Time in seconds =23.36 lu.C.x_threads_144_run_2.log: Time in seconds =22.96 lu.C.x_threads_144_run_3.log: Time in seconds =22.52 This patch takes out all the shiny wake_affine stuff and goes back to utter basics. Rik was there another NUMA benchmark that wanted your fancy stuff? Because NAS isn't it. (the previous, slightly fancier wake_affine was basically a !idle extension of this, in that it would pick the 'shortest' of the 2 queues and thereby run quickest, in approximation) I'll try and run a number of other benchmarks I have around to see if there's anything that shows a difference between the below trivial wake_affine and the old 2-cpu-load one. --- include/linux/sched/topology.h | 8 --- kernel/sched/fair.c| 125 ++--- 2 files changed, 6 insertions(+), 127 deletions(-) diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h index d7b6dab956ec..7d065abc7a47 100644 --- a/include/linux/sched/topology.h +++ b/include/linux/sched/topology.h @@ -71,14 +71,6 @@ struct sched_domain_shared { atomic_tref; atomic_tnr_busy_cpus; int has_idle_cores; - - /* -* Some variables from the most recent sd_lb_stats for this domain, -* used by wake_affine(). -*/ - unsigned long nr_running; - unsigned long load; - unsigned long capacity; }; struct sched_domain { diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 70ba32e08a23..66930ce338af 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5356,115 +5356,19 @@ static int wake_wide(struct task_struct *p) return 1; } -struct llc_stats { - unsigned long nr_running; - unsigned long load; - unsigned long capacity; - int has_capacity; -}; - -static bool get_llc_stats(struct llc_stats *stats, int cpu) -{ - struct sched_domain_shared *sds = rcu_dereference(per_cpu(sd_llc_shared, cpu)); - - if (!sds) - return false; - - stats->nr_running = READ_ONCE(sds->nr_running); - stats->load = READ_ONCE(sds->load); - stats->capacity = READ_ONCE(sds->capacity); - stats->has_capacity = stats->nr_running < per_cpu(sd_llc_size, cpu); - - return true; -} - -/* - * Can a task be moved from prev_cpu to this_cpu without causing a load - * imbalance that would trigger the load balancer? - * - * Since we're running on 'stale' values, we might in fact create an imbalance - * but recomputing these values is expensive, as that'd mean iteration 2 cache - * domains worth of CPUs. - */ -static bool -wake_affine_llc(struct sched_domain *sd, struct task_struct *p, - int this_cpu, int prev_cpu, int sync) -{ - struct llc_stats prev_stats, this_stats; - s64 this_eff_load, prev_eff_load;
Re: sysbench throughput degradation in 4.13+
On 09/22/2017 11:53 AM, Peter Zijlstra wrote: On Fri, Sep 22, 2017 at 11:03:39AM -0400, Eric Farman wrote: Hi Peter, Rik, With OSS last week, I'm sure this got lost in the deluge, so here's a friendly ping. Very much so, inbox is a giant trainwreck ;-) My apologies. :) I picked up 4.14.0-rc1 earlier this week, and still see the degradation described above. Not really a surprise, since I don't see any other commits in this area beyond the ones I mentioned in my original note. Anyway, I'm unsure what else to try or what doc to pull to help debug this, and would appreciate your expertise here. We can repro this pretty easily as necessary to help get to the bottom of this. Many thanks in advance, Could you describe your sysbench setup? Are you running it on mysql or postgresql, what other options? MySQL. We've tried a few different configs with both test=oltp and test=threads, but both show the same behavior. What I have settled on for my repro is the following: sudo sysbench --db-driver=mysql --mysql-host=localhost --mysql-user=root --mysql-db=test1 --max-time=180 --max-requests=1 --num-threads=8 --test=oltp prepare sudo sysbench --db-driver=mysql --mysql-host=localhost --mysql-user=root --mysql-db=test1 --max-time=180 --max-requests=1 --num-threads=8 --test=oltp run Have also some environments where multiple sysbench instances are run concurrently (I've tried up to 8, with db's test1-8 being specified), but doesn't appear to much matter either. - Eric
Re: sysbench throughput degradation in 4.13+
On 09/22/2017 11:53 AM, Peter Zijlstra wrote: On Fri, Sep 22, 2017 at 11:03:39AM -0400, Eric Farman wrote: Hi Peter, Rik, With OSS last week, I'm sure this got lost in the deluge, so here's a friendly ping. Very much so, inbox is a giant trainwreck ;-) My apologies. :) I picked up 4.14.0-rc1 earlier this week, and still see the degradation described above. Not really a surprise, since I don't see any other commits in this area beyond the ones I mentioned in my original note. Anyway, I'm unsure what else to try or what doc to pull to help debug this, and would appreciate your expertise here. We can repro this pretty easily as necessary to help get to the bottom of this. Many thanks in advance, Could you describe your sysbench setup? Are you running it on mysql or postgresql, what other options? MySQL. We've tried a few different configs with both test=oltp and test=threads, but both show the same behavior. What I have settled on for my repro is the following: sudo sysbench --db-driver=mysql --mysql-host=localhost --mysql-user=root --mysql-db=test1 --max-time=180 --max-requests=1 --num-threads=8 --test=oltp prepare sudo sysbench --db-driver=mysql --mysql-host=localhost --mysql-user=root --mysql-db=test1 --max-time=180 --max-requests=1 --num-threads=8 --test=oltp run Have also some environments where multiple sysbench instances are run concurrently (I've tried up to 8, with db's test1-8 being specified), but doesn't appear to much matter either. - Eric
Re: sysbench throughput degradation in 4.13+
On Fri, Sep 22, 2017 at 11:03:39AM -0400, Eric Farman wrote: > Hi Peter, Rik, > > With OSS last week, I'm sure this got lost in the deluge, so here's a > friendly ping. Very much so, inbox is a giant trainwreck ;-) > I picked up 4.14.0-rc1 earlier this week, and still see the > degradation described above. Not really a surprise, since I don't see any > other commits in this area beyond the ones I mentioned in my original note. > > Anyway, I'm unsure what else to try or what doc to pull to help debug this, > and would appreciate your expertise here. We can repro this pretty easily > as necessary to help get to the bottom of this. > > Many thanks in advance, Could you describe your sysbench setup? Are you running it on mysql or postgresql, what other options?
Re: sysbench throughput degradation in 4.13+
On Fri, Sep 22, 2017 at 11:03:39AM -0400, Eric Farman wrote: > Hi Peter, Rik, > > With OSS last week, I'm sure this got lost in the deluge, so here's a > friendly ping. Very much so, inbox is a giant trainwreck ;-) > I picked up 4.14.0-rc1 earlier this week, and still see the > degradation described above. Not really a surprise, since I don't see any > other commits in this area beyond the ones I mentioned in my original note. > > Anyway, I'm unsure what else to try or what doc to pull to help debug this, > and would appreciate your expertise here. We can repro this pretty easily > as necessary to help get to the bottom of this. > > Many thanks in advance, Could you describe your sysbench setup? Are you running it on mysql or postgresql, what other options?
Re: sysbench throughput degradation in 4.13+
On 09/13/2017 04:24 AM, 王金浦 wrote: 2017-09-12 16:14 GMT+02:00 Eric Farman: Hi Peter, Rik, Running sysbench measurements in a 16CPU/30GB KVM guest on a 20CPU/40GB s390x host, we noticed a throughput degradation (anywhere between 13% and 40%, depending on test) when moving the host from kernel 4.12 to 4.13. The rest of the host and the entire guest remain unchanged; it is only the host kernel that changes. Bisecting the host kernel blames commit 3fed382b46ba ("sched/numa: Implement NUMA node level wake_affine()"). Reverting 3fed382b46ba and 815abf5af45f ("sched/fair: Remove effective_load()") from a clean 4.13.0 build erases the throughput degradation and returns us to what we see in 4.12.0. A little poking around points us to a fix/improvement to this, commit 90001d67be2f ("sched/fair: Fix wake_affine() for !NUMA_BALANCING"), which went in the 4.14 merge window and an unmerged fix [1] that corrects a small error in that patch. Hopeful, since we were running !NUMA_BALANCING, I applied these two patches to a clean 4.13.0 tree but continue to see the performance degradation. Pulling current master or linux-next shows no improvement lurking in the shadows. Running perf stat on the host during the guest sysbench run shows a significant increase in cpu-migrations over the 4.12.0 run. Abbreviated examples follow: # 4.12.0 # perf stat -p 11473 -- sleep 5 62305.199305 task-clock (msec) # 12.458 CPUs 368,607 context-switches 4,084 cpu-migrations 416 page-faults # 4.13.0 # perf stat -p 11444 -- sleep 5 35892.653243 task-clock (msec) #7.176 CPUs 249,251 context-switches 56,850 cpu-migrations 804 page-faults # 4.13.0-revert-3fed382b46ba-and-815abf5af45f # perf stat -p 11441 -- sleep 5 62321.767146 task-clock (msec) # 12.459 CPUs 387,661 context-switches 5,687 cpu-migrations 1,652 page-faults # 4.13.0-apply-90001d67be2f # perf stat -p 11438 -- sleep 5 48654.988291 task-clock (msec) #9.729 CPUs 363,150 context-switches 43,778 cpu-migrations 641 page-faults I'm not sure what doc to supply here and am unfamiliar with this code or its recent changes, but I'd be happy to pull/try whatever is needed to help debug things. Looking forward to hearing what I can do. Thanks, Eric [1] https://lkml.org/lkml/2017/9/6/196 +cc: vcap...@pengaru.com He reported a performance degradation also on 4.13-rc7, it might be the same cause. Best, Jack Hi Peter, Rik, With OSS last week, I'm sure this got lost in the deluge, so here's a friendly ping. I picked up 4.14.0-rc1 earlier this week, and still see the degradation described above. Not really a surprise, since I don't see any other commits in this area beyond the ones I mentioned in my original note. Anyway, I'm unsure what else to try or what doc to pull to help debug this, and would appreciate your expertise here. We can repro this pretty easily as necessary to help get to the bottom of this. Many thanks in advance, - Eric (also, +cc Matt to help when I'm out of office myself.)
Re: sysbench throughput degradation in 4.13+
On 09/13/2017 04:24 AM, 王金浦 wrote: 2017-09-12 16:14 GMT+02:00 Eric Farman : Hi Peter, Rik, Running sysbench measurements in a 16CPU/30GB KVM guest on a 20CPU/40GB s390x host, we noticed a throughput degradation (anywhere between 13% and 40%, depending on test) when moving the host from kernel 4.12 to 4.13. The rest of the host and the entire guest remain unchanged; it is only the host kernel that changes. Bisecting the host kernel blames commit 3fed382b46ba ("sched/numa: Implement NUMA node level wake_affine()"). Reverting 3fed382b46ba and 815abf5af45f ("sched/fair: Remove effective_load()") from a clean 4.13.0 build erases the throughput degradation and returns us to what we see in 4.12.0. A little poking around points us to a fix/improvement to this, commit 90001d67be2f ("sched/fair: Fix wake_affine() for !NUMA_BALANCING"), which went in the 4.14 merge window and an unmerged fix [1] that corrects a small error in that patch. Hopeful, since we were running !NUMA_BALANCING, I applied these two patches to a clean 4.13.0 tree but continue to see the performance degradation. Pulling current master or linux-next shows no improvement lurking in the shadows. Running perf stat on the host during the guest sysbench run shows a significant increase in cpu-migrations over the 4.12.0 run. Abbreviated examples follow: # 4.12.0 # perf stat -p 11473 -- sleep 5 62305.199305 task-clock (msec) # 12.458 CPUs 368,607 context-switches 4,084 cpu-migrations 416 page-faults # 4.13.0 # perf stat -p 11444 -- sleep 5 35892.653243 task-clock (msec) #7.176 CPUs 249,251 context-switches 56,850 cpu-migrations 804 page-faults # 4.13.0-revert-3fed382b46ba-and-815abf5af45f # perf stat -p 11441 -- sleep 5 62321.767146 task-clock (msec) # 12.459 CPUs 387,661 context-switches 5,687 cpu-migrations 1,652 page-faults # 4.13.0-apply-90001d67be2f # perf stat -p 11438 -- sleep 5 48654.988291 task-clock (msec) #9.729 CPUs 363,150 context-switches 43,778 cpu-migrations 641 page-faults I'm not sure what doc to supply here and am unfamiliar with this code or its recent changes, but I'd be happy to pull/try whatever is needed to help debug things. Looking forward to hearing what I can do. Thanks, Eric [1] https://lkml.org/lkml/2017/9/6/196 +cc: vcap...@pengaru.com He reported a performance degradation also on 4.13-rc7, it might be the same cause. Best, Jack Hi Peter, Rik, With OSS last week, I'm sure this got lost in the deluge, so here's a friendly ping. I picked up 4.14.0-rc1 earlier this week, and still see the degradation described above. Not really a surprise, since I don't see any other commits in this area beyond the ones I mentioned in my original note. Anyway, I'm unsure what else to try or what doc to pull to help debug this, and would appreciate your expertise here. We can repro this pretty easily as necessary to help get to the bottom of this. Many thanks in advance, - Eric (also, +cc Matt to help when I'm out of office myself.)
Re: sysbench throughput degradation in 4.13+
2017-09-12 16:14 GMT+02:00 Eric Farman: > Hi Peter, Rik, > > Running sysbench measurements in a 16CPU/30GB KVM guest on a 20CPU/40GB > s390x host, we noticed a throughput degradation (anywhere between 13% and > 40%, depending on test) when moving the host from kernel 4.12 to 4.13. The > rest of the host and the entire guest remain unchanged; it is only the host > kernel that changes. Bisecting the host kernel blames commit 3fed382b46ba > ("sched/numa: Implement NUMA node level wake_affine()"). > > Reverting 3fed382b46ba and 815abf5af45f ("sched/fair: Remove > effective_load()") from a clean 4.13.0 build erases the throughput > degradation and returns us to what we see in 4.12.0. > > A little poking around points us to a fix/improvement to this, commit > 90001d67be2f ("sched/fair: Fix wake_affine() for !NUMA_BALANCING"), which > went in the 4.14 merge window and an unmerged fix [1] that corrects a small > error in that patch. Hopeful, since we were running !NUMA_BALANCING, I > applied these two patches to a clean 4.13.0 tree but continue to see the > performance degradation. Pulling current master or linux-next shows no > improvement lurking in the shadows. > > Running perf stat on the host during the guest sysbench run shows a > significant increase in cpu-migrations over the 4.12.0 run. Abbreviated > examples follow: > > # 4.12.0 > # perf stat -p 11473 -- sleep 5 > 62305.199305 task-clock (msec) # 12.458 CPUs >368,607 context-switches > 4,084 cpu-migrations >416 page-faults > > # 4.13.0 > # perf stat -p 11444 -- sleep 5 > 35892.653243 task-clock (msec) #7.176 CPUs >249,251 context-switches > 56,850 cpu-migrations >804 page-faults > > # 4.13.0-revert-3fed382b46ba-and-815abf5af45f > # perf stat -p 11441 -- sleep 5 > 62321.767146 task-clock (msec) # 12.459 CPUs >387,661 context-switches > 5,687 cpu-migrations > 1,652 page-faults > > # 4.13.0-apply-90001d67be2f > # perf stat -p 11438 -- sleep 5 > 48654.988291 task-clock (msec) #9.729 CPUs >363,150 context-switches > 43,778 cpu-migrations >641 page-faults > > I'm not sure what doc to supply here and am unfamiliar with this code or its > recent changes, but I'd be happy to pull/try whatever is needed to help > debug things. Looking forward to hearing what I can do. > > Thanks, > Eric > > [1] https://lkml.org/lkml/2017/9/6/196 > +cc: vcap...@pengaru.com He reported a performance degradation also on 4.13-rc7, it might be the same cause. Best, Jack
Re: sysbench throughput degradation in 4.13+
2017-09-12 16:14 GMT+02:00 Eric Farman : > Hi Peter, Rik, > > Running sysbench measurements in a 16CPU/30GB KVM guest on a 20CPU/40GB > s390x host, we noticed a throughput degradation (anywhere between 13% and > 40%, depending on test) when moving the host from kernel 4.12 to 4.13. The > rest of the host and the entire guest remain unchanged; it is only the host > kernel that changes. Bisecting the host kernel blames commit 3fed382b46ba > ("sched/numa: Implement NUMA node level wake_affine()"). > > Reverting 3fed382b46ba and 815abf5af45f ("sched/fair: Remove > effective_load()") from a clean 4.13.0 build erases the throughput > degradation and returns us to what we see in 4.12.0. > > A little poking around points us to a fix/improvement to this, commit > 90001d67be2f ("sched/fair: Fix wake_affine() for !NUMA_BALANCING"), which > went in the 4.14 merge window and an unmerged fix [1] that corrects a small > error in that patch. Hopeful, since we were running !NUMA_BALANCING, I > applied these two patches to a clean 4.13.0 tree but continue to see the > performance degradation. Pulling current master or linux-next shows no > improvement lurking in the shadows. > > Running perf stat on the host during the guest sysbench run shows a > significant increase in cpu-migrations over the 4.12.0 run. Abbreviated > examples follow: > > # 4.12.0 > # perf stat -p 11473 -- sleep 5 > 62305.199305 task-clock (msec) # 12.458 CPUs >368,607 context-switches > 4,084 cpu-migrations >416 page-faults > > # 4.13.0 > # perf stat -p 11444 -- sleep 5 > 35892.653243 task-clock (msec) #7.176 CPUs >249,251 context-switches > 56,850 cpu-migrations >804 page-faults > > # 4.13.0-revert-3fed382b46ba-and-815abf5af45f > # perf stat -p 11441 -- sleep 5 > 62321.767146 task-clock (msec) # 12.459 CPUs >387,661 context-switches > 5,687 cpu-migrations > 1,652 page-faults > > # 4.13.0-apply-90001d67be2f > # perf stat -p 11438 -- sleep 5 > 48654.988291 task-clock (msec) #9.729 CPUs >363,150 context-switches > 43,778 cpu-migrations >641 page-faults > > I'm not sure what doc to supply here and am unfamiliar with this code or its > recent changes, but I'd be happy to pull/try whatever is needed to help > debug things. Looking forward to hearing what I can do. > > Thanks, > Eric > > [1] https://lkml.org/lkml/2017/9/6/196 > +cc: vcap...@pengaru.com He reported a performance degradation also on 4.13-rc7, it might be the same cause. Best, Jack
sysbench throughput degradation in 4.13+
Hi Peter, Rik, Running sysbench measurements in a 16CPU/30GB KVM guest on a 20CPU/40GB s390x host, we noticed a throughput degradation (anywhere between 13% and 40%, depending on test) when moving the host from kernel 4.12 to 4.13. The rest of the host and the entire guest remain unchanged; it is only the host kernel that changes. Bisecting the host kernel blames commit 3fed382b46ba ("sched/numa: Implement NUMA node level wake_affine()"). Reverting 3fed382b46ba and 815abf5af45f ("sched/fair: Remove effective_load()") from a clean 4.13.0 build erases the throughput degradation and returns us to what we see in 4.12.0. A little poking around points us to a fix/improvement to this, commit 90001d67be2f ("sched/fair: Fix wake_affine() for !NUMA_BALANCING"), which went in the 4.14 merge window and an unmerged fix [1] that corrects a small error in that patch. Hopeful, since we were running !NUMA_BALANCING, I applied these two patches to a clean 4.13.0 tree but continue to see the performance degradation. Pulling current master or linux-next shows no improvement lurking in the shadows. Running perf stat on the host during the guest sysbench run shows a significant increase in cpu-migrations over the 4.12.0 run. Abbreviated examples follow: # 4.12.0 # perf stat -p 11473 -- sleep 5 62305.199305 task-clock (msec) # 12.458 CPUs 368,607 context-switches 4,084 cpu-migrations 416 page-faults # 4.13.0 # perf stat -p 11444 -- sleep 5 35892.653243 task-clock (msec) #7.176 CPUs 249,251 context-switches 56,850 cpu-migrations 804 page-faults # 4.13.0-revert-3fed382b46ba-and-815abf5af45f # perf stat -p 11441 -- sleep 5 62321.767146 task-clock (msec) # 12.459 CPUs 387,661 context-switches 5,687 cpu-migrations 1,652 page-faults # 4.13.0-apply-90001d67be2f # perf stat -p 11438 -- sleep 5 48654.988291 task-clock (msec) #9.729 CPUs 363,150 context-switches 43,778 cpu-migrations 641 page-faults I'm not sure what doc to supply here and am unfamiliar with this code or its recent changes, but I'd be happy to pull/try whatever is needed to help debug things. Looking forward to hearing what I can do. Thanks, Eric [1] https://lkml.org/lkml/2017/9/6/196
sysbench throughput degradation in 4.13+
Hi Peter, Rik, Running sysbench measurements in a 16CPU/30GB KVM guest on a 20CPU/40GB s390x host, we noticed a throughput degradation (anywhere between 13% and 40%, depending on test) when moving the host from kernel 4.12 to 4.13. The rest of the host and the entire guest remain unchanged; it is only the host kernel that changes. Bisecting the host kernel blames commit 3fed382b46ba ("sched/numa: Implement NUMA node level wake_affine()"). Reverting 3fed382b46ba and 815abf5af45f ("sched/fair: Remove effective_load()") from a clean 4.13.0 build erases the throughput degradation and returns us to what we see in 4.12.0. A little poking around points us to a fix/improvement to this, commit 90001d67be2f ("sched/fair: Fix wake_affine() for !NUMA_BALANCING"), which went in the 4.14 merge window and an unmerged fix [1] that corrects a small error in that patch. Hopeful, since we were running !NUMA_BALANCING, I applied these two patches to a clean 4.13.0 tree but continue to see the performance degradation. Pulling current master or linux-next shows no improvement lurking in the shadows. Running perf stat on the host during the guest sysbench run shows a significant increase in cpu-migrations over the 4.12.0 run. Abbreviated examples follow: # 4.12.0 # perf stat -p 11473 -- sleep 5 62305.199305 task-clock (msec) # 12.458 CPUs 368,607 context-switches 4,084 cpu-migrations 416 page-faults # 4.13.0 # perf stat -p 11444 -- sleep 5 35892.653243 task-clock (msec) #7.176 CPUs 249,251 context-switches 56,850 cpu-migrations 804 page-faults # 4.13.0-revert-3fed382b46ba-and-815abf5af45f # perf stat -p 11441 -- sleep 5 62321.767146 task-clock (msec) # 12.459 CPUs 387,661 context-switches 5,687 cpu-migrations 1,652 page-faults # 4.13.0-apply-90001d67be2f # perf stat -p 11438 -- sleep 5 48654.988291 task-clock (msec) #9.729 CPUs 363,150 context-switches 43,778 cpu-migrations 641 page-faults I'm not sure what doc to supply here and am unfamiliar with this code or its recent changes, but I'd be happy to pull/try whatever is needed to help debug things. Looking forward to hearing what I can do. Thanks, Eric [1] https://lkml.org/lkml/2017/9/6/196