Re: [RFT][patch] Scheduling for HTT and not only
Hi, 2012/4/9 Alexander Motin m...@freebsd.org: [...] I have strong feeling that while this test may be interesting for profiling, it's own results in first place depend not from how fast scheduler is, but from the pipes capacity and other alike things. Can somebody hint me what except pipe capacity and context switch to unblocked receiver prevents sender from sending all data in batch and then receiver from receiving them all in batch? If different OSes have different policies there, I think results could be incomparable. Let me disagree on your conclusion. If OS A does a task in X seconds, and OS B does the same task in Y seconds, if Y X, then OS B is just not performing good enough. Internal implementation's difference for the task can not be waived as an excuse for result's comparability. - Arnaud ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFT][patch] Scheduling for HTT and not only
On 04/10/12 19:58, Arnaud Lacombe wrote: 2012/4/9 Alexander Motinm...@freebsd.org: [...] I have strong feeling that while this test may be interesting for profiling, it's own results in first place depend not from how fast scheduler is, but from the pipes capacity and other alike things. Can somebody hint me what except pipe capacity and context switch to unblocked receiver prevents sender from sending all data in batch and then receiver from receiving them all in batch? If different OSes have different policies there, I think results could be incomparable. Let me disagree on your conclusion. If OS A does a task in X seconds, and OS B does the same task in Y seconds, if Y X, then OS B is just not performing good enough. Internal implementation's difference for the task can not be waived as an excuse for result's comparability. Sure, numbers are always numbers, but the question is what are they showing? Understanding of the test results is even more important for purely synthetic tests like this. Especially when one test run gives 25 seconds, while another gives 50. This test is not completely clear to me and that is what I've told. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFT][patch] Scheduling for HTT and not only
On 04/10/12 20:18, Alexander Motin wrote: On 04/10/12 19:58, Arnaud Lacombe wrote: 2012/4/9 Alexander Motinm...@freebsd.org: [...] I have strong feeling that while this test may be interesting for profiling, it's own results in first place depend not from how fast scheduler is, but from the pipes capacity and other alike things. Can somebody hint me what except pipe capacity and context switch to unblocked receiver prevents sender from sending all data in batch and then receiver from receiving them all in batch? If different OSes have different policies there, I think results could be incomparable. Let me disagree on your conclusion. If OS A does a task in X seconds, and OS B does the same task in Y seconds, if Y X, then OS B is just not performing good enough. Internal implementation's difference for the task can not be waived as an excuse for result's comparability. Sure, numbers are always numbers, but the question is what are they showing? Understanding of the test results is even more important for purely synthetic tests like this. Especially when one test run gives 25 seconds, while another gives 50. This test is not completely clear to me and that is what I've told. Small illustration to my point. Simple scheduler tuning affects thread preemption policy and changes this test results in three times: mav@test:/test/hackbench# ./hackbench 30 process 1000 Running with 30*40 (== 1200) tasks. Time: 9.568 mav@test:/test/hackbench# sysctl kern.sched.interact=0 kern.sched.interact: 30 - 0 mav@test:/test/hackbench# ./hackbench 30 process 1000 Running with 30*40 (== 1200) tasks. Time: 5.163 mav@test:/test/hackbench# sysctl kern.sched.interact=100 kern.sched.interact: 0 - 100 mav@test:/test/hackbench# ./hackbench 30 process 1000 Running with 30*40 (== 1200) tasks. Time: 3.190 I think it affects balance between pipe latency and bandwidth, while test measures only the last. It is clear that conclusion from these numbers depends on what do we want to have. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFT][patch] Scheduling for HTT and not only
Hi, On Tue, Apr 10, 2012 at 1:53 PM, Alexander Motin m...@freebsd.org wrote: On 04/10/12 20:18, Alexander Motin wrote: On 04/10/12 19:58, Arnaud Lacombe wrote: 2012/4/9 Alexander Motinm...@freebsd.org: [...] I have strong feeling that while this test may be interesting for profiling, it's own results in first place depend not from how fast scheduler is, but from the pipes capacity and other alike things. Can somebody hint me what except pipe capacity and context switch to unblocked receiver prevents sender from sending all data in batch and then receiver from receiving them all in batch? If different OSes have different policies there, I think results could be incomparable. Let me disagree on your conclusion. If OS A does a task in X seconds, and OS B does the same task in Y seconds, if Y X, then OS B is just not performing good enough. Internal implementation's difference for the task can not be waived as an excuse for result's comparability. Sure, numbers are always numbers, but the question is what are they showing? Understanding of the test results is even more important for purely synthetic tests like this. Especially when one test run gives 25 seconds, while another gives 50. This test is not completely clear to me and that is what I've told. Small illustration to my point. Simple scheduler tuning affects thread preemption policy and changes this test results in three times: mav@test:/test/hackbench# ./hackbench 30 process 1000 Running with 30*40 (== 1200) tasks. Time: 9.568 mav@test:/test/hackbench# sysctl kern.sched.interact=0 kern.sched.interact: 30 - 0 mav@test:/test/hackbench# ./hackbench 30 process 1000 Running with 30*40 (== 1200) tasks. Time: 5.163 mav@test:/test/hackbench# sysctl kern.sched.interact=100 kern.sched.interact: 0 - 100 mav@test:/test/hackbench# ./hackbench 30 process 1000 Running with 30*40 (== 1200) tasks. Time: 3.190 I think it affects balance between pipe latency and bandwidth, while test measures only the last. It is clear that conclusion from these numbers depends on what do we want to have. I don't really care on this point, I'm only testing default values, or more precisely, whatever developers though default values would be good. Btw, you are testing 3 differents configuration. Different results are expected. What worries me more is the rather the huge instability on the *same* configuration, say on a pipe/thread/70 groups/600 iterations run, where results range from 2.7s[0] to 7.4s, or a socket/thread/20 groups/1400 iterations run, where results range from 2.4s to 4.5s. - Arnaud [0]: numbers extracted from a recent run of 9.0-RELEASE on a Xeon E5-1650 platform. ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFT][patch] Scheduling for HTT and not only
On 04/10/12 21:46, Arnaud Lacombe wrote: On Tue, Apr 10, 2012 at 1:53 PM, Alexander Motinm...@freebsd.org wrote: On 04/10/12 20:18, Alexander Motin wrote: On 04/10/12 19:58, Arnaud Lacombe wrote: 2012/4/9 Alexander Motinm...@freebsd.org: I have strong feeling that while this test may be interesting for profiling, it's own results in first place depend not from how fast scheduler is, but from the pipes capacity and other alike things. Can somebody hint me what except pipe capacity and context switch to unblocked receiver prevents sender from sending all data in batch and then receiver from receiving them all in batch? If different OSes have different policies there, I think results could be incomparable. Let me disagree on your conclusion. If OS A does a task in X seconds, and OS B does the same task in Y seconds, if Y X, then OS B is just not performing good enough. Internal implementation's difference for the task can not be waived as an excuse for result's comparability. Sure, numbers are always numbers, but the question is what are they showing? Understanding of the test results is even more important for purely synthetic tests like this. Especially when one test run gives 25 seconds, while another gives 50. This test is not completely clear to me and that is what I've told. Small illustration to my point. Simple scheduler tuning affects thread preemption policy and changes this test results in three times: mav@test:/test/hackbench# ./hackbench 30 process 1000 Running with 30*40 (== 1200) tasks. Time: 9.568 mav@test:/test/hackbench# sysctl kern.sched.interact=0 kern.sched.interact: 30 - 0 mav@test:/test/hackbench# ./hackbench 30 process 1000 Running with 30*40 (== 1200) tasks. Time: 5.163 mav@test:/test/hackbench# sysctl kern.sched.interact=100 kern.sched.interact: 0 - 100 mav@test:/test/hackbench# ./hackbench 30 process 1000 Running with 30*40 (== 1200) tasks. Time: 3.190 I think it affects balance between pipe latency and bandwidth, while test measures only the last. It is clear that conclusion from these numbers depends on what do we want to have. I don't really care on this point, I'm only testing default values, or more precisely, whatever developers though default values would be good. Btw, you are testing 3 differents configuration. Different results are expected. What worries me more is the rather the huge instability on the *same* configuration, say on a pipe/thread/70 groups/600 iterations run, where results range from 2.7s[0] to 7.4s, or a socket/thread/20 groups/1400 iterations run, where results range from 2.4s to 4.5s. Due to reason I've pointed in my first message this test is _extremely_ sensitive to context switch interval. The more aggressive scheduler switches threads, the smaller will be pipe latency, but the smaller will be also bandwidth. During test run scheduler all the time recalculates interactivity index for each thread, trying to balance between latency and switching overhead. With hundreds of threads running simultaneously and interfering with each other it is quite unpredictable process. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFT][patch] Scheduling for HTT and not only
On 04/05/12 21:45, Alexander Motin wrote: On 05.04.2012 21:12, Arnaud Lacombe wrote: Hi, [Sorry for the delay, I got a bit sidetrack'ed...] 2012/2/17 Alexander Motinm...@freebsd.org: On 17.02.2012 18:53, Arnaud Lacombe wrote: On Fri, Feb 17, 2012 at 11:29 AM, Alexander Motinm...@freebsd.org wrote: On 02/15/12 21:54, Jeff Roberson wrote: On Wed, 15 Feb 2012, Alexander Motin wrote: I've decided to stop those cache black magic practices and focus on things that really exist in this world -- SMT and CPU load. I've dropped most of cache related things from the patch and made the rest of things more strict and predictable: http://people.freebsd.org/~mav/sched.htt34.patch This looks great. I think there is value in considering the other approach further but I would like to do this part first. It would be nice to also add priority as a greater influence in the load balancing as well. I haven't got good idea yet about balancing priorities, but I've rewritten balancer itself. As soon as sched_lowest() / sched_highest() are more intelligent now, they allowed to remove topology traversing from the balancer itself. That should fix double-swapping problem, allow to keep some affinity while moving threads and make balancing more fair. I did number of tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8 and 16 threads everything is stationary as it should. With 9 threads I see regular and random load move between all 8 CPUs. Measurements on 5 minutes run show deviation of only about 5 seconds. It is the same deviation as I see caused by only scheduling of 16 threads on 8 cores without any balancing needed at all. So I believe this code works as it should. Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch I plan this to be a final patch of this series (more to come :)) and if there will be no problems or objections, I am going to commit it (except some debugging KTRs) in about ten days. So now it's a good time for reviews and testing. :) is there a place where all the patches are available ? All my scheduler patches are cumulative, so all you need is only the last mentioned here sched.htt40.patch. You may want to have a look to the result I collected in the `runs/freebsd-experiments' branch of: https://github.com/lacombar/hackbench/ and compare them with vanilla FreeBSD 9.0 and -CURRENT results available in `runs/freebsd'. On the dual package platform, your patch is not a definite win. But in some cases, especially for multi-socket systems, to let it show its best, you may want to apply additional patch from avg@ to better detect CPU topology: https://gitorious.org/~avg/freebsd/avgbsd/commit/6bca4a2e4854ea3fc275946a023db65c483cb9dd test I conducted specifically for this patch did not showed much improvement... If I understand right, this test runs thousands of threads sending and receiving data over the pipes. It is quite likely that all CPUs will be always busy and so load balancing is not really important in this test, What looks good is that more complicated new code is not slower then old one. While this test seems very scheduler-intensive, it may depend on many other factors, such as syscall performance, context switch, etc. I'll try to play more with it. My profiling on 8-core Core i7 system shows that code from sched_ule.c staying on first places consumes still only 13% of kernel CPU time, while doing million of context switches per second. cpu_search(), affected by this patch, even less -- only 8%. The rest of time is spread between many small other functions. I did some optimizations at r234066 to reduce cpu_search(0 time to 6%, but looking on how unstable results of this test are, hardly any difference there can be really measured by it. I have strong feeling that while this test may be interesting for profiling, it's own results in first place depend not from how fast scheduler is, but from the pipes capacity and other alike things. Can somebody hint me what except pipe capacity and context switch to unblocked receiver prevents sender from sending all data in batch and then receiver from receiving them all in batch? If different OSes have different policies there, I think results could be incomparable. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFT][patch] Scheduling for HTT and not only
Il 05 aprile 2012 19:12, Arnaud Lacombe lacom...@gmail.com ha scritto: Hi, [Sorry for the delay, I got a bit sidetrack'ed...] 2012/2/17 Alexander Motin m...@freebsd.org: On 17.02.2012 18:53, Arnaud Lacombe wrote: On Fri, Feb 17, 2012 at 11:29 AM, Alexander Motinm...@freebsd.org wrote: On 02/15/12 21:54, Jeff Roberson wrote: On Wed, 15 Feb 2012, Alexander Motin wrote: I've decided to stop those cache black magic practices and focus on things that really exist in this world -- SMT and CPU load. I've dropped most of cache related things from the patch and made the rest of things more strict and predictable: http://people.freebsd.org/~mav/sched.htt34.patch This looks great. I think there is value in considering the other approach further but I would like to do this part first. It would be nice to also add priority as a greater influence in the load balancing as well. I haven't got good idea yet about balancing priorities, but I've rewritten balancer itself. As soon as sched_lowest() / sched_highest() are more intelligent now, they allowed to remove topology traversing from the balancer itself. That should fix double-swapping problem, allow to keep some affinity while moving threads and make balancing more fair. I did number of tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8 and 16 threads everything is stationary as it should. With 9 threads I see regular and random load move between all 8 CPUs. Measurements on 5 minutes run show deviation of only about 5 seconds. It is the same deviation as I see caused by only scheduling of 16 threads on 8 cores without any balancing needed at all. So I believe this code works as it should. Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch I plan this to be a final patch of this series (more to come :)) and if there will be no problems or objections, I am going to commit it (except some debugging KTRs) in about ten days. So now it's a good time for reviews and testing. :) is there a place where all the patches are available ? All my scheduler patches are cumulative, so all you need is only the last mentioned here sched.htt40.patch. You may want to have a look to the result I collected in the `runs/freebsd-experiments' branch of: https://github.com/lacombar/hackbench/ and compare them with vanilla FreeBSD 9.0 and -CURRENT results available in `runs/freebsd'. On the dual package platform, your patch is not a definite win. But in some cases, especially for multi-socket systems, to let it show its best, you may want to apply additional patch from avg@ to better detect CPU topology: https://gitorious.org/~avg/freebsd/avgbsd/commit/6bca4a2e4854ea3fc275946a023db65c483cb9dd test I conducted specifically for this patch did not showed much improvement... Can you please clarify on this point? The test you did included cases where the topology was detected badly against cases where the topology was detected correctly as a patched kernel (and you still didn't see a performance improvement), in terms of cache line sharing? Attilio -- Peace can only be achieved by understanding - A. Einstein ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFT][patch] Scheduling for HTT and not only
On 04/06/12 17:13, Attilio Rao wrote: Il 05 aprile 2012 19:12, Arnaud Lacombelacom...@gmail.com ha scritto: Hi, [Sorry for the delay, I got a bit sidetrack'ed...] 2012/2/17 Alexander Motinm...@freebsd.org: On 17.02.2012 18:53, Arnaud Lacombe wrote: On Fri, Feb 17, 2012 at 11:29 AM, Alexander Motinm...@freebsd.orgwrote: On 02/15/12 21:54, Jeff Roberson wrote: On Wed, 15 Feb 2012, Alexander Motin wrote: I've decided to stop those cache black magic practices and focus on things that really exist in this world -- SMT and CPU load. I've dropped most of cache related things from the patch and made the rest of things more strict and predictable: http://people.freebsd.org/~mav/sched.htt34.patch This looks great. I think there is value in considering the other approach further but I would like to do this part first. It would be nice to also add priority as a greater influence in the load balancing as well. I haven't got good idea yet about balancing priorities, but I've rewritten balancer itself. As soon as sched_lowest() / sched_highest() are more intelligent now, they allowed to remove topology traversing from the balancer itself. That should fix double-swapping problem, allow to keep some affinity while moving threads and make balancing more fair. I did number of tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8 and 16 threads everything is stationary as it should. With 9 threads I see regular and random load move between all 8 CPUs. Measurements on 5 minutes run show deviation of only about 5 seconds. It is the same deviation as I see caused by only scheduling of 16 threads on 8 cores without any balancing needed at all. So I believe this code works as it should. Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch I plan this to be a final patch of this series (more to come :)) and if there will be no problems or objections, I am going to commit it (except some debugging KTRs) in about ten days. So now it's a good time for reviews and testing. :) is there a place where all the patches are available ? All my scheduler patches are cumulative, so all you need is only the last mentioned here sched.htt40.patch. You may want to have a look to the result I collected in the `runs/freebsd-experiments' branch of: https://github.com/lacombar/hackbench/ and compare them with vanilla FreeBSD 9.0 and -CURRENT results available in `runs/freebsd'. On the dual package platform, your patch is not a definite win. But in some cases, especially for multi-socket systems, to let it show its best, you may want to apply additional patch from avg@ to better detect CPU topology: https://gitorious.org/~avg/freebsd/avgbsd/commit/6bca4a2e4854ea3fc275946a023db65c483cb9dd test I conducted specifically for this patch did not showed much improvement... Can you please clarify on this point? The test you did included cases where the topology was detected badly against cases where the topology was detected correctly as a patched kernel (and you still didn't see a performance improvement), in terms of cache line sharing? At this moment SCHED_ULE does almost nothing in terms of cache line sharing affinity (though it probably worth some further experiments). What this patch may improve is opposite case -- reduce cache sharing pressure for cache-hungry applications. For example, proper cache topology detection (such as lack of global L3 cache, but shared L2 per pairs of cores on Core2Quad class CPUs) increases pbzip2 performance when number of threads is less then number of CPUs (i.e. when there is place for optimization). -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFT][patch] Scheduling for HTT and not only
Il 06 aprile 2012 15:27, Alexander Motin m...@freebsd.org ha scritto: On 04/06/12 17:13, Attilio Rao wrote: Il 05 aprile 2012 19:12, Arnaud Lacombelacom...@gmail.com ha scritto: Hi, [Sorry for the delay, I got a bit sidetrack'ed...] 2012/2/17 Alexander Motinm...@freebsd.org: On 17.02.2012 18:53, Arnaud Lacombe wrote: On Fri, Feb 17, 2012 at 11:29 AM, Alexander Motinm...@freebsd.org wrote: On 02/15/12 21:54, Jeff Roberson wrote: On Wed, 15 Feb 2012, Alexander Motin wrote: I've decided to stop those cache black magic practices and focus on things that really exist in this world -- SMT and CPU load. I've dropped most of cache related things from the patch and made the rest of things more strict and predictable: http://people.freebsd.org/~mav/sched.htt34.patch This looks great. I think there is value in considering the other approach further but I would like to do this part first. It would be nice to also add priority as a greater influence in the load balancing as well. I haven't got good idea yet about balancing priorities, but I've rewritten balancer itself. As soon as sched_lowest() / sched_highest() are more intelligent now, they allowed to remove topology traversing from the balancer itself. That should fix double-swapping problem, allow to keep some affinity while moving threads and make balancing more fair. I did number of tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8 and 16 threads everything is stationary as it should. With 9 threads I see regular and random load move between all 8 CPUs. Measurements on 5 minutes run show deviation of only about 5 seconds. It is the same deviation as I see caused by only scheduling of 16 threads on 8 cores without any balancing needed at all. So I believe this code works as it should. Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch I plan this to be a final patch of this series (more to come :)) and if there will be no problems or objections, I am going to commit it (except some debugging KTRs) in about ten days. So now it's a good time for reviews and testing. :) is there a place where all the patches are available ? All my scheduler patches are cumulative, so all you need is only the last mentioned here sched.htt40.patch. You may want to have a look to the result I collected in the `runs/freebsd-experiments' branch of: https://github.com/lacombar/hackbench/ and compare them with vanilla FreeBSD 9.0 and -CURRENT results available in `runs/freebsd'. On the dual package platform, your patch is not a definite win. But in some cases, especially for multi-socket systems, to let it show its best, you may want to apply additional patch from avg@ to better detect CPU topology: https://gitorious.org/~avg/freebsd/avgbsd/commit/6bca4a2e4854ea3fc275946a023db65c483cb9dd test I conducted specifically for this patch did not showed much improvement... Can you please clarify on this point? The test you did included cases where the topology was detected badly against cases where the topology was detected correctly as a patched kernel (and you still didn't see a performance improvement), in terms of cache line sharing? At this moment SCHED_ULE does almost nothing in terms of cache line sharing affinity (though it probably worth some further experiments). What this patch may improve is opposite case -- reduce cache sharing pressure for cache-hungry applications. For example, proper cache topology detection (such as lack of global L3 cache, but shared L2 per pairs of cores on Core2Quad class CPUs) increases pbzip2 performance when number of threads is less then number of CPUs (i.e. when there is place for optimization). My asking is not referred to your patch really. I just wanted to know if he correctly benchmark a case where the topology was screwed up and then correctly recognized by avg's patch in terms of cache level aggregation (it wasn't referred to your patch btw). Attilio -- Peace can only be achieved by understanding - A. Einstein ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFT][patch] Scheduling for HTT and not only
On 04/06/12 17:30, Attilio Rao wrote: Il 06 aprile 2012 15:27, Alexander Motinm...@freebsd.org ha scritto: On 04/06/12 17:13, Attilio Rao wrote: Il 05 aprile 2012 19:12, Arnaud Lacombelacom...@gmail.comha scritto: Hi, [Sorry for the delay, I got a bit sidetrack'ed...] 2012/2/17 Alexander Motinm...@freebsd.org: On 17.02.2012 18:53, Arnaud Lacombe wrote: On Fri, Feb 17, 2012 at 11:29 AM, Alexander Motinm...@freebsd.org wrote: On 02/15/12 21:54, Jeff Roberson wrote: On Wed, 15 Feb 2012, Alexander Motin wrote: I've decided to stop those cache black magic practices and focus on things that really exist in this world -- SMT and CPU load. I've dropped most of cache related things from the patch and made the rest of things more strict and predictable: http://people.freebsd.org/~mav/sched.htt34.patch This looks great. I think there is value in considering the other approach further but I would like to do this part first. It would be nice to also add priority as a greater influence in the load balancing as well. I haven't got good idea yet about balancing priorities, but I've rewritten balancer itself. As soon as sched_lowest() / sched_highest() are more intelligent now, they allowed to remove topology traversing from the balancer itself. That should fix double-swapping problem, allow to keep some affinity while moving threads and make balancing more fair. I did number of tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8 and 16 threads everything is stationary as it should. With 9 threads I see regular and random load move between all 8 CPUs. Measurements on 5 minutes run show deviation of only about 5 seconds. It is the same deviation as I see caused by only scheduling of 16 threads on 8 cores without any balancing needed at all. So I believe this code works as it should. Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch I plan this to be a final patch of this series (more to come :)) and if there will be no problems or objections, I am going to commit it (except some debugging KTRs) in about ten days. So now it's a good time for reviews and testing. :) is there a place where all the patches are available ? All my scheduler patches are cumulative, so all you need is only the last mentioned here sched.htt40.patch. You may want to have a look to the result I collected in the `runs/freebsd-experiments' branch of: https://github.com/lacombar/hackbench/ and compare them with vanilla FreeBSD 9.0 and -CURRENT results available in `runs/freebsd'. On the dual package platform, your patch is not a definite win. But in some cases, especially for multi-socket systems, to let it show its best, you may want to apply additional patch from avg@ to better detect CPU topology: https://gitorious.org/~avg/freebsd/avgbsd/commit/6bca4a2e4854ea3fc275946a023db65c483cb9dd test I conducted specifically for this patch did not showed much improvement... Can you please clarify on this point? The test you did included cases where the topology was detected badly against cases where the topology was detected correctly as a patched kernel (and you still didn't see a performance improvement), in terms of cache line sharing? At this moment SCHED_ULE does almost nothing in terms of cache line sharing affinity (though it probably worth some further experiments). What this patch may improve is opposite case -- reduce cache sharing pressure for cache-hungry applications. For example, proper cache topology detection (such as lack of global L3 cache, but shared L2 per pairs of cores on Core2Quad class CPUs) increases pbzip2 performance when number of threads is less then number of CPUs (i.e. when there is place for optimization). My asking is not referred to your patch really. I just wanted to know if he correctly benchmark a case where the topology was screwed up and then correctly recognized by avg's patch in terms of cache level aggregation (it wasn't referred to your patch btw). I understand. I've just described test case when properly detected topology could give benefit. What the test really does is indeed a good question. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFT][patch] Scheduling for HTT and not only
Hi, [Sorry for the delay, I got a bit sidetrack'ed...] 2012/2/17 Alexander Motin m...@freebsd.org: On 17.02.2012 18:53, Arnaud Lacombe wrote: On Fri, Feb 17, 2012 at 11:29 AM, Alexander Motinm...@freebsd.org wrote: On 02/15/12 21:54, Jeff Roberson wrote: On Wed, 15 Feb 2012, Alexander Motin wrote: I've decided to stop those cache black magic practices and focus on things that really exist in this world -- SMT and CPU load. I've dropped most of cache related things from the patch and made the rest of things more strict and predictable: http://people.freebsd.org/~mav/sched.htt34.patch This looks great. I think there is value in considering the other approach further but I would like to do this part first. It would be nice to also add priority as a greater influence in the load balancing as well. I haven't got good idea yet about balancing priorities, but I've rewritten balancer itself. As soon as sched_lowest() / sched_highest() are more intelligent now, they allowed to remove topology traversing from the balancer itself. That should fix double-swapping problem, allow to keep some affinity while moving threads and make balancing more fair. I did number of tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8 and 16 threads everything is stationary as it should. With 9 threads I see regular and random load move between all 8 CPUs. Measurements on 5 minutes run show deviation of only about 5 seconds. It is the same deviation as I see caused by only scheduling of 16 threads on 8 cores without any balancing needed at all. So I believe this code works as it should. Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch I plan this to be a final patch of this series (more to come :)) and if there will be no problems or objections, I am going to commit it (except some debugging KTRs) in about ten days. So now it's a good time for reviews and testing. :) is there a place where all the patches are available ? All my scheduler patches are cumulative, so all you need is only the last mentioned here sched.htt40.patch. You may want to have a look to the result I collected in the `runs/freebsd-experiments' branch of: https://github.com/lacombar/hackbench/ and compare them with vanilla FreeBSD 9.0 and -CURRENT results available in `runs/freebsd'. On the dual package platform, your patch is not a definite win. But in some cases, especially for multi-socket systems, to let it show its best, you may want to apply additional patch from avg@ to better detect CPU topology: https://gitorious.org/~avg/freebsd/avgbsd/commit/6bca4a2e4854ea3fc275946a023db65c483cb9dd test I conducted specifically for this patch did not showed much improvement... - Arnaud ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFT][patch] Scheduling for HTT and not only
On 05.04.2012 21:12, Arnaud Lacombe wrote: Hi, [Sorry for the delay, I got a bit sidetrack'ed...] 2012/2/17 Alexander Motinm...@freebsd.org: On 17.02.2012 18:53, Arnaud Lacombe wrote: On Fri, Feb 17, 2012 at 11:29 AM, Alexander Motinm...@freebsd.orgwrote: On 02/15/12 21:54, Jeff Roberson wrote: On Wed, 15 Feb 2012, Alexander Motin wrote: I've decided to stop those cache black magic practices and focus on things that really exist in this world -- SMT and CPU load. I've dropped most of cache related things from the patch and made the rest of things more strict and predictable: http://people.freebsd.org/~mav/sched.htt34.patch This looks great. I think there is value in considering the other approach further but I would like to do this part first. It would be nice to also add priority as a greater influence in the load balancing as well. I haven't got good idea yet about balancing priorities, but I've rewritten balancer itself. As soon as sched_lowest() / sched_highest() are more intelligent now, they allowed to remove topology traversing from the balancer itself. That should fix double-swapping problem, allow to keep some affinity while moving threads and make balancing more fair. I did number of tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8 and 16 threads everything is stationary as it should. With 9 threads I see regular and random load move between all 8 CPUs. Measurements on 5 minutes run show deviation of only about 5 seconds. It is the same deviation as I see caused by only scheduling of 16 threads on 8 cores without any balancing needed at all. So I believe this code works as it should. Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch I plan this to be a final patch of this series (more to come :)) and if there will be no problems or objections, I am going to commit it (except some debugging KTRs) in about ten days. So now it's a good time for reviews and testing. :) is there a place where all the patches are available ? All my scheduler patches are cumulative, so all you need is only the last mentioned here sched.htt40.patch. You may want to have a look to the result I collected in the `runs/freebsd-experiments' branch of: https://github.com/lacombar/hackbench/ and compare them with vanilla FreeBSD 9.0 and -CURRENT results available in `runs/freebsd'. On the dual package platform, your patch is not a definite win. But in some cases, especially for multi-socket systems, to let it show its best, you may want to apply additional patch from avg@ to better detect CPU topology: https://gitorious.org/~avg/freebsd/avgbsd/commit/6bca4a2e4854ea3fc275946a023db65c483cb9dd test I conducted specifically for this patch did not showed much improvement... If I understand right, this test runs thousands of threads sending and receiving data over the pipes. It is quite likely that all CPUs will be always busy and so load balancing is not really important in this test, What looks good is that more complicated new code is not slower then old one. While this test seems very scheduler-intensive, it may depend on many other factors, such as syscall performance, context switch, etc. I'll try to play more with it. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFT][patch] Scheduling for HTT and not only
On 02/15/12 21:54, Jeff Roberson wrote: On Wed, 15 Feb 2012, Alexander Motin wrote: I've decided to stop those cache black magic practices and focus on things that really exist in this world -- SMT and CPU load. I've dropped most of cache related things from the patch and made the rest of things more strict and predictable: http://people.freebsd.org/~mav/sched.htt34.patch This looks great. I think there is value in considering the other approach further but I would like to do this part first. It would be nice to also add priority as a greater influence in the load balancing as well. I haven't got good idea yet about balancing priorities, but I've rewritten balancer itself. As soon as sched_lowest() / sched_highest() are more intelligent now, they allowed to remove topology traversing from the balancer itself. That should fix double-swapping problem, allow to keep some affinity while moving threads and make balancing more fair. I did number of tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8 and 16 threads everything is stationary as it should. With 9 threads I see regular and random load move between all 8 CPUs. Measurements on 5 minutes run show deviation of only about 5 seconds. It is the same deviation as I see caused by only scheduling of 16 threads on 8 cores without any balancing needed at all. So I believe this code works as it should. Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch I plan this to be a final patch of this series (more to come :)) and if there will be no problems or objections, I am going to commit it (except some debugging KTRs) in about ten days. So now it's a good time for reviews and testing. :) -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFT][patch] Scheduling for HTT and not only
On 17.02.2012 18:53, Arnaud Lacombe wrote: On Fri, Feb 17, 2012 at 11:29 AM, Alexander Motinm...@freebsd.org wrote: On 02/15/12 21:54, Jeff Roberson wrote: On Wed, 15 Feb 2012, Alexander Motin wrote: I've decided to stop those cache black magic practices and focus on things that really exist in this world -- SMT and CPU load. I've dropped most of cache related things from the patch and made the rest of things more strict and predictable: http://people.freebsd.org/~mav/sched.htt34.patch This looks great. I think there is value in considering the other approach further but I would like to do this part first. It would be nice to also add priority as a greater influence in the load balancing as well. I haven't got good idea yet about balancing priorities, but I've rewritten balancer itself. As soon as sched_lowest() / sched_highest() are more intelligent now, they allowed to remove topology traversing from the balancer itself. That should fix double-swapping problem, allow to keep some affinity while moving threads and make balancing more fair. I did number of tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8 and 16 threads everything is stationary as it should. With 9 threads I see regular and random load move between all 8 CPUs. Measurements on 5 minutes run show deviation of only about 5 seconds. It is the same deviation as I see caused by only scheduling of 16 threads on 8 cores without any balancing needed at all. So I believe this code works as it should. Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch I plan this to be a final patch of this series (more to come :)) and if there will be no problems or objections, I am going to commit it (except some debugging KTRs) in about ten days. So now it's a good time for reviews and testing. :) is there a place where all the patches are available ? All my scheduler patches are cumulative, so all you need is only the last mentioned here sched.htt40.patch. But in some cases, especially for multi-socket systems, to let it show its best, you may want to apply additional patch from avg@ to better detect CPU topology: https://gitorious.org/~avg/freebsd/avgbsd/commit/6bca4a2e4854ea3fc275946a023db65c483cb9dd I intend to run some tests on a 1x2x2 (atom D510), 1x4x1 (core-2 quad), and eventually a 2x8x2 platforms, against r231573. Results should hopefully be available by the end of the week-end/middle of next week[0]. [0]: the D510 will likely be testing a couple of Linux kernel over the week-end, and a FreeBSD run takes about 2.5 days to complete. -- Alexander Motin ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: [RFT][patch] Scheduling for HTT and not only
Hi, On Fri, Feb 17, 2012 at 11:29 AM, Alexander Motin m...@freebsd.org wrote: On 02/15/12 21:54, Jeff Roberson wrote: On Wed, 15 Feb 2012, Alexander Motin wrote: I've decided to stop those cache black magic practices and focus on things that really exist in this world -- SMT and CPU load. I've dropped most of cache related things from the patch and made the rest of things more strict and predictable: http://people.freebsd.org/~mav/sched.htt34.patch This looks great. I think there is value in considering the other approach further but I would like to do this part first. It would be nice to also add priority as a greater influence in the load balancing as well. I haven't got good idea yet about balancing priorities, but I've rewritten balancer itself. As soon as sched_lowest() / sched_highest() are more intelligent now, they allowed to remove topology traversing from the balancer itself. That should fix double-swapping problem, allow to keep some affinity while moving threads and make balancing more fair. I did number of tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8 and 16 threads everything is stationary as it should. With 9 threads I see regular and random load move between all 8 CPUs. Measurements on 5 minutes run show deviation of only about 5 seconds. It is the same deviation as I see caused by only scheduling of 16 threads on 8 cores without any balancing needed at all. So I believe this code works as it should. Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch I plan this to be a final patch of this series (more to come :)) and if there will be no problems or objections, I am going to commit it (except some debugging KTRs) in about ten days. So now it's a good time for reviews and testing. :) is there a place where all the patches are available ? I intend to run some tests on a 1x2x2 (atom D510), 1x4x1 (core-2 quad), and eventually a 2x8x2 platforms, against r231573. Results should hopefully be available by the end of the week-end/middle of next week[0]. - Arnaud [0]: the D510 will likely be testing a couple of Linux kernel over the week-end, and a FreeBSD run takes about 2.5 days to complete. -- Alexander Motin ___ freebsd-hack...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org