Re: [RFT][patch] Scheduling for HTT and not only

2012-04-10 Thread Arnaud Lacombe
Hi,

2012/4/9 Alexander Motin m...@freebsd.org:
 [...]

 I have strong feeling that while this test may be interesting for profiling,
 it's own results in first place depend not from how fast scheduler is, but
 from the pipes capacity and other alike things. Can somebody hint me what
 except pipe capacity and context switch to unblocked receiver prevents
 sender from sending all data in batch and then receiver from receiving them
 all in batch? If different OSes have different policies there, I think
 results could be incomparable.

Let me disagree on your conclusion. If OS A does a task in X seconds,
and OS B does the same task in Y seconds, if Y  X, then OS B is just
not performing good enough. Internal implementation's difference for
the task can not be waived as an excuse for result's comparability.

 - Arnaud
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFT][patch] Scheduling for HTT and not only

2012-04-10 Thread Alexander Motin

On 04/10/12 19:58, Arnaud Lacombe wrote:

2012/4/9 Alexander Motinm...@freebsd.org:

[...]

I have strong feeling that while this test may be interesting for profiling,
it's own results in first place depend not from how fast scheduler is, but
from the pipes capacity and other alike things. Can somebody hint me what
except pipe capacity and context switch to unblocked receiver prevents
sender from sending all data in batch and then receiver from receiving them
all in batch? If different OSes have different policies there, I think
results could be incomparable.


Let me disagree on your conclusion. If OS A does a task in X seconds,
and OS B does the same task in Y seconds, if Y  X, then OS B is just
not performing good enough. Internal implementation's difference for
the task can not be waived as an excuse for result's comparability.


Sure, numbers are always numbers, but the question is what are they 
showing? Understanding of the test results is even more important for 
purely synthetic tests like this. Especially when one test run gives 25 
seconds, while another gives 50. This test is not completely clear to me 
and that is what I've told.


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFT][patch] Scheduling for HTT and not only

2012-04-10 Thread Alexander Motin

On 04/10/12 20:18, Alexander Motin wrote:

On 04/10/12 19:58, Arnaud Lacombe wrote:

2012/4/9 Alexander Motinm...@freebsd.org:

[...]

I have strong feeling that while this test may be interesting for
profiling,
it's own results in first place depend not from how fast scheduler
is, but
from the pipes capacity and other alike things. Can somebody hint me
what
except pipe capacity and context switch to unblocked receiver prevents
sender from sending all data in batch and then receiver from
receiving them
all in batch? If different OSes have different policies there, I think
results could be incomparable.


Let me disagree on your conclusion. If OS A does a task in X seconds,
and OS B does the same task in Y seconds, if Y X, then OS B is just
not performing good enough. Internal implementation's difference for
the task can not be waived as an excuse for result's comparability.


Sure, numbers are always numbers, but the question is what are they
showing? Understanding of the test results is even more important for
purely synthetic tests like this. Especially when one test run gives 25
seconds, while another gives 50. This test is not completely clear to me
and that is what I've told.


Small illustration to my point. Simple scheduler tuning affects thread 
preemption policy and changes this test results in three times:


mav@test:/test/hackbench# ./hackbench 30 process 1000
Running with 30*40 (== 1200) tasks.
Time: 9.568

mav@test:/test/hackbench# sysctl kern.sched.interact=0
kern.sched.interact: 30 - 0
mav@test:/test/hackbench# ./hackbench 30 process 1000
Running with 30*40 (== 1200) tasks.
Time: 5.163

mav@test:/test/hackbench# sysctl kern.sched.interact=100
kern.sched.interact: 0 - 100
mav@test:/test/hackbench# ./hackbench 30 process 1000
Running with 30*40 (== 1200) tasks.
Time: 3.190

I think it affects balance between pipe latency and bandwidth, while 
test measures only the last. It is clear that conclusion from these 
numbers depends on what do we want to have.


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFT][patch] Scheduling for HTT and not only

2012-04-10 Thread Arnaud Lacombe
Hi,

On Tue, Apr 10, 2012 at 1:53 PM, Alexander Motin m...@freebsd.org wrote:
 On 04/10/12 20:18, Alexander Motin wrote:

 On 04/10/12 19:58, Arnaud Lacombe wrote:

 2012/4/9 Alexander Motinm...@freebsd.org:

 [...]

 I have strong feeling that while this test may be interesting for
 profiling,
 it's own results in first place depend not from how fast scheduler
 is, but
 from the pipes capacity and other alike things. Can somebody hint me
 what
 except pipe capacity and context switch to unblocked receiver prevents
 sender from sending all data in batch and then receiver from
 receiving them
 all in batch? If different OSes have different policies there, I think
 results could be incomparable.

 Let me disagree on your conclusion. If OS A does a task in X seconds,
 and OS B does the same task in Y seconds, if Y X, then OS B is just
 not performing good enough. Internal implementation's difference for
 the task can not be waived as an excuse for result's comparability.


 Sure, numbers are always numbers, but the question is what are they
 showing? Understanding of the test results is even more important for
 purely synthetic tests like this. Especially when one test run gives 25
 seconds, while another gives 50. This test is not completely clear to me
 and that is what I've told.

 Small illustration to my point. Simple scheduler tuning affects thread
 preemption policy and changes this test results in three times:

 mav@test:/test/hackbench# ./hackbench 30 process 1000
 Running with 30*40 (== 1200) tasks.
 Time: 9.568

 mav@test:/test/hackbench# sysctl kern.sched.interact=0
 kern.sched.interact: 30 - 0
 mav@test:/test/hackbench# ./hackbench 30 process 1000
 Running with 30*40 (== 1200) tasks.
 Time: 5.163

 mav@test:/test/hackbench# sysctl kern.sched.interact=100
 kern.sched.interact: 0 - 100
 mav@test:/test/hackbench# ./hackbench 30 process 1000
 Running with 30*40 (== 1200) tasks.
 Time: 3.190

 I think it affects balance between pipe latency and bandwidth, while test
 measures only the last. It is clear that conclusion from these numbers
 depends on what do we want to have.

I don't really care on this point, I'm only testing default values, or
more precisely, whatever developers though default values would be
good.

Btw, you are testing 3 differents configuration. Different results are
expected. What worries me more is the rather the huge instability on
the *same* configuration, say on a pipe/thread/70 groups/600
iterations run, where results range from 2.7s[0] to 7.4s, or a
socket/thread/20 groups/1400 iterations run, where results range from
2.4s to 4.5s.

 - Arnaud

[0]: numbers extracted from a recent run of 9.0-RELEASE on a Xeon
E5-1650 platform.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFT][patch] Scheduling for HTT and not only

2012-04-10 Thread Alexander Motin

On 04/10/12 21:46, Arnaud Lacombe wrote:

On Tue, Apr 10, 2012 at 1:53 PM, Alexander Motinm...@freebsd.org  wrote:

On 04/10/12 20:18, Alexander Motin wrote:

On 04/10/12 19:58, Arnaud Lacombe wrote:

2012/4/9 Alexander Motinm...@freebsd.org:

I have strong feeling that while this test may be interesting for
profiling,
it's own results in first place depend not from how fast scheduler
is, but
from the pipes capacity and other alike things. Can somebody hint me
what
except pipe capacity and context switch to unblocked receiver prevents
sender from sending all data in batch and then receiver from
receiving them
all in batch? If different OSes have different policies there, I think
results could be incomparable.


Let me disagree on your conclusion. If OS A does a task in X seconds,
and OS B does the same task in Y seconds, if Y  X, then OS B is just
not performing good enough. Internal implementation's difference for
the task can not be waived as an excuse for result's comparability.



Sure, numbers are always numbers, but the question is what are they
showing? Understanding of the test results is even more important for
purely synthetic tests like this. Especially when one test run gives 25
seconds, while another gives 50. This test is not completely clear to me
and that is what I've told.


Small illustration to my point. Simple scheduler tuning affects thread
preemption policy and changes this test results in three times:

mav@test:/test/hackbench# ./hackbench 30 process 1000
Running with 30*40 (== 1200) tasks.
Time: 9.568

mav@test:/test/hackbench# sysctl kern.sched.interact=0
kern.sched.interact: 30 -  0
mav@test:/test/hackbench# ./hackbench 30 process 1000
Running with 30*40 (== 1200) tasks.
Time: 5.163

mav@test:/test/hackbench# sysctl kern.sched.interact=100
kern.sched.interact: 0 -  100
mav@test:/test/hackbench# ./hackbench 30 process 1000
Running with 30*40 (== 1200) tasks.
Time: 3.190

I think it affects balance between pipe latency and bandwidth, while test
measures only the last. It is clear that conclusion from these numbers
depends on what do we want to have.


I don't really care on this point, I'm only testing default values, or
more precisely, whatever developers though default values would be
good.

Btw, you are testing 3 differents configuration. Different results are
expected. What worries me more is the rather the huge instability on
the *same* configuration, say on a pipe/thread/70 groups/600
iterations run, where results range from 2.7s[0] to 7.4s, or a
socket/thread/20 groups/1400 iterations run, where results range from
2.4s to 4.5s.


Due to reason I've pointed in my first message this test is _extremely_ 
sensitive to context switch interval. The more aggressive scheduler 
switches threads, the smaller will be pipe latency, but the smaller will 
be also bandwidth. During test run scheduler all the time recalculates 
interactivity index for each thread, trying to balance between latency 
and switching overhead. With hundreds of threads running simultaneously 
and interfering with each other it is quite unpredictable process.


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFT][patch] Scheduling for HTT and not only

2012-04-09 Thread Alexander Motin

On 04/05/12 21:45, Alexander Motin wrote:

On 05.04.2012 21:12, Arnaud Lacombe wrote:

Hi,

[Sorry for the delay, I got a bit sidetrack'ed...]

2012/2/17 Alexander Motinm...@freebsd.org:

On 17.02.2012 18:53, Arnaud Lacombe wrote:


On Fri, Feb 17, 2012 at 11:29 AM, Alexander Motinm...@freebsd.org
wrote:


On 02/15/12 21:54, Jeff Roberson wrote:


On Wed, 15 Feb 2012, Alexander Motin wrote:


I've decided to stop those cache black magic practices and focus on
things that really exist in this world -- SMT and CPU load. I've
dropped most of cache related things from the patch and made the
rest
of things more strict and predictable:
http://people.freebsd.org/~mav/sched.htt34.patch



This looks great. I think there is value in considering the other
approach further but I would like to do this part first. It would be
nice to also add priority as a greater influence in the load
balancing
as well.



I haven't got good idea yet about balancing priorities, but I've
rewritten
balancer itself. As soon as sched_lowest() / sched_highest() are more
intelligent now, they allowed to remove topology traversing from the
balancer itself. That should fix double-swapping problem, allow to
keep
some
affinity while moving threads and make balancing more fair. I did
number
of
tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8
and
16
threads everything is stationary as it should. With 9 threads I see
regular
and random load move between all 8 CPUs. Measurements on 5 minutes run
show
deviation of only about 5 seconds. It is the same deviation as I see
caused
by only scheduling of 16 threads on 8 cores without any balancing
needed
at
all. So I believe this code works as it should.

Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch

I plan this to be a final patch of this series (more to come :))
and if
there will be no problems or objections, I am going to commit it
(except
some debugging KTRs) in about ten days. So now it's a good time for
reviews
and testing. :)


is there a place where all the patches are available ?



All my scheduler patches are cumulative, so all you need is only the
last
mentioned here sched.htt40.patch.


You may want to have a look to the result I collected in the
`runs/freebsd-experiments' branch of:

https://github.com/lacombar/hackbench/

and compare them with vanilla FreeBSD 9.0 and -CURRENT results
available in `runs/freebsd'. On the dual package platform, your patch
is not a definite win.


But in some cases, especially for multi-socket systems, to let it
show its
best, you may want to apply additional patch from avg@ to better
detect CPU
topology:
https://gitorious.org/~avg/freebsd/avgbsd/commit/6bca4a2e4854ea3fc275946a023db65c483cb9dd



test I conducted specifically for this patch did not showed much
improvement...


If I understand right, this test runs thousands of threads sending and
receiving data over the pipes. It is quite likely that all CPUs will be
always busy and so load balancing is not really important in this test,
What looks good is that more complicated new code is not slower then old
one.

While this test seems very scheduler-intensive, it may depend on many
other factors, such as syscall performance, context switch, etc. I'll
try to play more with it.


My profiling on 8-core Core i7 system shows that code from sched_ule.c 
staying on first places consumes still only 13% of kernel CPU time, 
while doing million of context switches per second. cpu_search(), 
affected by this patch, even less -- only 8%. The rest of time is spread 
between many small other functions. I did some optimizations at r234066 
to reduce cpu_search(0 time to 6%, but looking on how unstable results 
of this test are, hardly any difference there can be really measured by it.


I have strong feeling that while this test may be interesting for 
profiling, it's own results in first place depend not from how fast 
scheduler is, but from the pipes capacity and other alike things. Can 
somebody hint me what except pipe capacity and context switch to 
unblocked receiver prevents sender from sending all data in batch and 
then receiver from receiving them all in batch? If different OSes have 
different policies there, I think results could be incomparable.


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFT][patch] Scheduling for HTT and not only

2012-04-06 Thread Attilio Rao
Il 05 aprile 2012 19:12, Arnaud Lacombe lacom...@gmail.com ha scritto:
 Hi,

 [Sorry for the delay, I got a bit sidetrack'ed...]

 2012/2/17 Alexander Motin m...@freebsd.org:
 On 17.02.2012 18:53, Arnaud Lacombe wrote:

 On Fri, Feb 17, 2012 at 11:29 AM, Alexander Motinm...@freebsd.org  wrote:

 On 02/15/12 21:54, Jeff Roberson wrote:

 On Wed, 15 Feb 2012, Alexander Motin wrote:

 I've decided to stop those cache black magic practices and focus on
 things that really exist in this world -- SMT and CPU load. I've
 dropped most of cache related things from the patch and made the rest
 of things more strict and predictable:
 http://people.freebsd.org/~mav/sched.htt34.patch


 This looks great. I think there is value in considering the other
 approach further but I would like to do this part first. It would be
 nice to also add priority as a greater influence in the load balancing
 as well.


 I haven't got good idea yet about balancing priorities, but I've
 rewritten
 balancer itself. As soon as sched_lowest() / sched_highest() are more
 intelligent now, they allowed to remove topology traversing from the
 balancer itself. That should fix double-swapping problem, allow to keep
 some
 affinity while moving threads and make balancing more fair. I did number
 of
 tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8 and
 16
 threads everything is stationary as it should. With 9 threads I see
 regular
 and random load move between all 8 CPUs. Measurements on 5 minutes run
 show
 deviation of only about 5 seconds. It is the same deviation as I see
 caused
 by only scheduling of 16 threads on 8 cores without any balancing needed
 at
 all. So I believe this code works as it should.

 Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch

 I plan this to be a final patch of this series (more to come :)) and if
 there will be no problems or objections, I am going to commit it (except
 some debugging KTRs) in about ten days. So now it's a good time for
 reviews
 and testing. :)

 is there a place where all the patches are available ?


 All my scheduler patches are cumulative, so all you need is only the last
 mentioned here sched.htt40.patch.

 You may want to have a look to the result I collected in the
 `runs/freebsd-experiments' branch of:

 https://github.com/lacombar/hackbench/

 and compare them with vanilla FreeBSD 9.0 and -CURRENT results
 available in `runs/freebsd'. On the dual package platform, your patch
 is not a definite win.

 But in some cases, especially for multi-socket systems, to let it show its
 best, you may want to apply additional patch from avg@ to better detect CPU
 topology:
 https://gitorious.org/~avg/freebsd/avgbsd/commit/6bca4a2e4854ea3fc275946a023db65c483cb9dd

 test I conducted specifically for this patch did not showed much 
 improvement...

Can you please clarify on this point?
The test you did included cases where the topology was detected badly
against cases where the topology was detected correctly as a patched
kernel (and you still didn't see a performance improvement), in terms
of cache line sharing?

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFT][patch] Scheduling for HTT and not only

2012-04-06 Thread Alexander Motin

On 04/06/12 17:13, Attilio Rao wrote:

Il 05 aprile 2012 19:12, Arnaud Lacombelacom...@gmail.com  ha scritto:

Hi,

[Sorry for the delay, I got a bit sidetrack'ed...]

2012/2/17 Alexander Motinm...@freebsd.org:

On 17.02.2012 18:53, Arnaud Lacombe wrote:


On Fri, Feb 17, 2012 at 11:29 AM, Alexander Motinm...@freebsd.orgwrote:


On 02/15/12 21:54, Jeff Roberson wrote:


On Wed, 15 Feb 2012, Alexander Motin wrote:


I've decided to stop those cache black magic practices and focus on
things that really exist in this world -- SMT and CPU load. I've
dropped most of cache related things from the patch and made the rest
of things more strict and predictable:
http://people.freebsd.org/~mav/sched.htt34.patch



This looks great. I think there is value in considering the other
approach further but I would like to do this part first. It would be
nice to also add priority as a greater influence in the load balancing
as well.



I haven't got good idea yet about balancing priorities, but I've
rewritten
balancer itself. As soon as sched_lowest() / sched_highest() are more
intelligent now, they allowed to remove topology traversing from the
balancer itself. That should fix double-swapping problem, allow to keep
some
affinity while moving threads and make balancing more fair. I did number
of
tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8 and
16
threads everything is stationary as it should. With 9 threads I see
regular
and random load move between all 8 CPUs. Measurements on 5 minutes run
show
deviation of only about 5 seconds. It is the same deviation as I see
caused
by only scheduling of 16 threads on 8 cores without any balancing needed
at
all. So I believe this code works as it should.

Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch

I plan this to be a final patch of this series (more to come :)) and if
there will be no problems or objections, I am going to commit it (except
some debugging KTRs) in about ten days. So now it's a good time for
reviews
and testing. :)


is there a place where all the patches are available ?



All my scheduler patches are cumulative, so all you need is only the last
mentioned here sched.htt40.patch.


You may want to have a look to the result I collected in the
`runs/freebsd-experiments' branch of:

https://github.com/lacombar/hackbench/

and compare them with vanilla FreeBSD 9.0 and -CURRENT results
available in `runs/freebsd'. On the dual package platform, your patch
is not a definite win.


But in some cases, especially for multi-socket systems, to let it show its
best, you may want to apply additional patch from avg@ to better detect CPU
topology:
https://gitorious.org/~avg/freebsd/avgbsd/commit/6bca4a2e4854ea3fc275946a023db65c483cb9dd


test I conducted specifically for this patch did not showed much improvement...


Can you please clarify on this point?
The test you did included cases where the topology was detected badly
against cases where the topology was detected correctly as a patched
kernel (and you still didn't see a performance improvement), in terms
of cache line sharing?


At this moment SCHED_ULE does almost nothing in terms of cache line 
sharing affinity (though it probably worth some further experiments). 
What this patch may improve is opposite case -- reduce cache sharing 
pressure for cache-hungry applications. For example, proper cache 
topology detection (such as lack of global L3 cache, but shared L2 per 
pairs of cores on Core2Quad class CPUs) increases pbzip2 performance 
when number of threads is less then number of CPUs (i.e. when there is 
place for optimization).


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFT][patch] Scheduling for HTT and not only

2012-04-06 Thread Attilio Rao
Il 06 aprile 2012 15:27, Alexander Motin m...@freebsd.org ha scritto:
 On 04/06/12 17:13, Attilio Rao wrote:

 Il 05 aprile 2012 19:12, Arnaud Lacombelacom...@gmail.com  ha scritto:

 Hi,

 [Sorry for the delay, I got a bit sidetrack'ed...]

 2012/2/17 Alexander Motinm...@freebsd.org:

 On 17.02.2012 18:53, Arnaud Lacombe wrote:


 On Fri, Feb 17, 2012 at 11:29 AM, Alexander Motinm...@freebsd.org
  wrote:


 On 02/15/12 21:54, Jeff Roberson wrote:


 On Wed, 15 Feb 2012, Alexander Motin wrote:


 I've decided to stop those cache black magic practices and focus on
 things that really exist in this world -- SMT and CPU load. I've
 dropped most of cache related things from the patch and made the
 rest
 of things more strict and predictable:
 http://people.freebsd.org/~mav/sched.htt34.patch



 This looks great. I think there is value in considering the other
 approach further but I would like to do this part first. It would be
 nice to also add priority as a greater influence in the load
 balancing
 as well.



 I haven't got good idea yet about balancing priorities, but I've
 rewritten
 balancer itself. As soon as sched_lowest() / sched_highest() are more
 intelligent now, they allowed to remove topology traversing from the
 balancer itself. That should fix double-swapping problem, allow to
 keep
 some
 affinity while moving threads and make balancing more fair. I did
 number
 of
 tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8
 and
 16
 threads everything is stationary as it should. With 9 threads I see
 regular
 and random load move between all 8 CPUs. Measurements on 5 minutes run
 show
 deviation of only about 5 seconds. It is the same deviation as I see
 caused
 by only scheduling of 16 threads on 8 cores without any balancing
 needed
 at
 all. So I believe this code works as it should.

 Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch

 I plan this to be a final patch of this series (more to come :)) and
 if
 there will be no problems or objections, I am going to commit it
 (except
 some debugging KTRs) in about ten days. So now it's a good time for
 reviews
 and testing. :)

 is there a place where all the patches are available ?



 All my scheduler patches are cumulative, so all you need is only the
 last
 mentioned here sched.htt40.patch.

 You may want to have a look to the result I collected in the
 `runs/freebsd-experiments' branch of:

 https://github.com/lacombar/hackbench/

 and compare them with vanilla FreeBSD 9.0 and -CURRENT results
 available in `runs/freebsd'. On the dual package platform, your patch
 is not a definite win.

 But in some cases, especially for multi-socket systems, to let it show
 its
 best, you may want to apply additional patch from avg@ to better detect
 CPU
 topology:

 https://gitorious.org/~avg/freebsd/avgbsd/commit/6bca4a2e4854ea3fc275946a023db65c483cb9dd

 test I conducted specifically for this patch did not showed much
 improvement...


 Can you please clarify on this point?
 The test you did included cases where the topology was detected badly
 against cases where the topology was detected correctly as a patched
 kernel (and you still didn't see a performance improvement), in terms
 of cache line sharing?


 At this moment SCHED_ULE does almost nothing in terms of cache line sharing
 affinity (though it probably worth some further experiments). What this
 patch may improve is opposite case -- reduce cache sharing pressure for
 cache-hungry applications. For example, proper cache topology detection
 (such as lack of global L3 cache, but shared L2 per pairs of cores on
 Core2Quad class CPUs) increases pbzip2 performance when number of threads is
 less then number of CPUs (i.e. when there is place for optimization).

My asking is not referred to your patch really.
I just wanted to know if he correctly benchmark a case where the
topology was screwed up and then correctly recognized by avg's patch
in terms of cache level aggregation (it wasn't referred to your patch
btw).

Attilio


-- 
Peace can only be achieved by understanding - A. Einstein
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFT][patch] Scheduling for HTT and not only

2012-04-06 Thread Alexander Motin

On 04/06/12 17:30, Attilio Rao wrote:

Il 06 aprile 2012 15:27, Alexander Motinm...@freebsd.org  ha scritto:

On 04/06/12 17:13, Attilio Rao wrote:


Il 05 aprile 2012 19:12, Arnaud Lacombelacom...@gmail.comha scritto:


Hi,

[Sorry for the delay, I got a bit sidetrack'ed...]

2012/2/17 Alexander Motinm...@freebsd.org:


On 17.02.2012 18:53, Arnaud Lacombe wrote:



On Fri, Feb 17, 2012 at 11:29 AM, Alexander Motinm...@freebsd.org
  wrote:



On 02/15/12 21:54, Jeff Roberson wrote:



On Wed, 15 Feb 2012, Alexander Motin wrote:



I've decided to stop those cache black magic practices and focus on
things that really exist in this world -- SMT and CPU load. I've
dropped most of cache related things from the patch and made the
rest
of things more strict and predictable:
http://people.freebsd.org/~mav/sched.htt34.patch




This looks great. I think there is value in considering the other
approach further but I would like to do this part first. It would be
nice to also add priority as a greater influence in the load
balancing
as well.




I haven't got good idea yet about balancing priorities, but I've
rewritten
balancer itself. As soon as sched_lowest() / sched_highest() are more
intelligent now, they allowed to remove topology traversing from the
balancer itself. That should fix double-swapping problem, allow to
keep
some
affinity while moving threads and make balancing more fair. I did
number
of
tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8
and
16
threads everything is stationary as it should. With 9 threads I see
regular
and random load move between all 8 CPUs. Measurements on 5 minutes run
show
deviation of only about 5 seconds. It is the same deviation as I see
caused
by only scheduling of 16 threads on 8 cores without any balancing
needed
at
all. So I believe this code works as it should.

Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch

I plan this to be a final patch of this series (more to come :)) and
if
there will be no problems or objections, I am going to commit it
(except
some debugging KTRs) in about ten days. So now it's a good time for
reviews
and testing. :)


is there a place where all the patches are available ?




All my scheduler patches are cumulative, so all you need is only the
last
mentioned here sched.htt40.patch.


You may want to have a look to the result I collected in the
`runs/freebsd-experiments' branch of:

https://github.com/lacombar/hackbench/

and compare them with vanilla FreeBSD 9.0 and -CURRENT results
available in `runs/freebsd'. On the dual package platform, your patch
is not a definite win.


But in some cases, especially for multi-socket systems, to let it show
its
best, you may want to apply additional patch from avg@ to better detect
CPU
topology:

https://gitorious.org/~avg/freebsd/avgbsd/commit/6bca4a2e4854ea3fc275946a023db65c483cb9dd


test I conducted specifically for this patch did not showed much
improvement...



Can you please clarify on this point?
The test you did included cases where the topology was detected badly
against cases where the topology was detected correctly as a patched
kernel (and you still didn't see a performance improvement), in terms
of cache line sharing?



At this moment SCHED_ULE does almost nothing in terms of cache line sharing
affinity (though it probably worth some further experiments). What this
patch may improve is opposite case -- reduce cache sharing pressure for
cache-hungry applications. For example, proper cache topology detection
(such as lack of global L3 cache, but shared L2 per pairs of cores on
Core2Quad class CPUs) increases pbzip2 performance when number of threads is
less then number of CPUs (i.e. when there is place for optimization).


My asking is not referred to your patch really.
I just wanted to know if he correctly benchmark a case where the
topology was screwed up and then correctly recognized by avg's patch
in terms of cache level aggregation (it wasn't referred to your patch
btw).


I understand. I've just described test case when properly detected 
topology could give benefit. What the test really does is indeed a good 
question.


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFT][patch] Scheduling for HTT and not only

2012-04-05 Thread Arnaud Lacombe
Hi,

[Sorry for the delay, I got a bit sidetrack'ed...]

2012/2/17 Alexander Motin m...@freebsd.org:
 On 17.02.2012 18:53, Arnaud Lacombe wrote:

 On Fri, Feb 17, 2012 at 11:29 AM, Alexander Motinm...@freebsd.org  wrote:

 On 02/15/12 21:54, Jeff Roberson wrote:

 On Wed, 15 Feb 2012, Alexander Motin wrote:

 I've decided to stop those cache black magic practices and focus on
 things that really exist in this world -- SMT and CPU load. I've
 dropped most of cache related things from the patch and made the rest
 of things more strict and predictable:
 http://people.freebsd.org/~mav/sched.htt34.patch


 This looks great. I think there is value in considering the other
 approach further but I would like to do this part first. It would be
 nice to also add priority as a greater influence in the load balancing
 as well.


 I haven't got good idea yet about balancing priorities, but I've
 rewritten
 balancer itself. As soon as sched_lowest() / sched_highest() are more
 intelligent now, they allowed to remove topology traversing from the
 balancer itself. That should fix double-swapping problem, allow to keep
 some
 affinity while moving threads and make balancing more fair. I did number
 of
 tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8 and
 16
 threads everything is stationary as it should. With 9 threads I see
 regular
 and random load move between all 8 CPUs. Measurements on 5 minutes run
 show
 deviation of only about 5 seconds. It is the same deviation as I see
 caused
 by only scheduling of 16 threads on 8 cores without any balancing needed
 at
 all. So I believe this code works as it should.

 Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch

 I plan this to be a final patch of this series (more to come :)) and if
 there will be no problems or objections, I am going to commit it (except
 some debugging KTRs) in about ten days. So now it's a good time for
 reviews
 and testing. :)

 is there a place where all the patches are available ?


 All my scheduler patches are cumulative, so all you need is only the last
 mentioned here sched.htt40.patch.

You may want to have a look to the result I collected in the
`runs/freebsd-experiments' branch of:

https://github.com/lacombar/hackbench/

and compare them with vanilla FreeBSD 9.0 and -CURRENT results
available in `runs/freebsd'. On the dual package platform, your patch
is not a definite win.

 But in some cases, especially for multi-socket systems, to let it show its
 best, you may want to apply additional patch from avg@ to better detect CPU
 topology:
 https://gitorious.org/~avg/freebsd/avgbsd/commit/6bca4a2e4854ea3fc275946a023db65c483cb9dd

test I conducted specifically for this patch did not showed much improvement...

 - Arnaud
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFT][patch] Scheduling for HTT and not only

2012-04-05 Thread Alexander Motin

On 05.04.2012 21:12, Arnaud Lacombe wrote:

Hi,

[Sorry for the delay, I got a bit sidetrack'ed...]

2012/2/17 Alexander Motinm...@freebsd.org:

On 17.02.2012 18:53, Arnaud Lacombe wrote:


On Fri, Feb 17, 2012 at 11:29 AM, Alexander Motinm...@freebsd.orgwrote:


On 02/15/12 21:54, Jeff Roberson wrote:


On Wed, 15 Feb 2012, Alexander Motin wrote:


I've decided to stop those cache black magic practices and focus on
things that really exist in this world -- SMT and CPU load. I've
dropped most of cache related things from the patch and made the rest
of things more strict and predictable:
http://people.freebsd.org/~mav/sched.htt34.patch



This looks great. I think there is value in considering the other
approach further but I would like to do this part first. It would be
nice to also add priority as a greater influence in the load balancing
as well.



I haven't got good idea yet about balancing priorities, but I've
rewritten
balancer itself. As soon as sched_lowest() / sched_highest() are more
intelligent now, they allowed to remove topology traversing from the
balancer itself. That should fix double-swapping problem, allow to keep
some
affinity while moving threads and make balancing more fair. I did number
of
tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8 and
16
threads everything is stationary as it should. With 9 threads I see
regular
and random load move between all 8 CPUs. Measurements on 5 minutes run
show
deviation of only about 5 seconds. It is the same deviation as I see
caused
by only scheduling of 16 threads on 8 cores without any balancing needed
at
all. So I believe this code works as it should.

Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch

I plan this to be a final patch of this series (more to come :)) and if
there will be no problems or objections, I am going to commit it (except
some debugging KTRs) in about ten days. So now it's a good time for
reviews
and testing. :)


is there a place where all the patches are available ?



All my scheduler patches are cumulative, so all you need is only the last
mentioned here sched.htt40.patch.


You may want to have a look to the result I collected in the
`runs/freebsd-experiments' branch of:

https://github.com/lacombar/hackbench/

and compare them with vanilla FreeBSD 9.0 and -CURRENT results
available in `runs/freebsd'. On the dual package platform, your patch
is not a definite win.


But in some cases, especially for multi-socket systems, to let it show its
best, you may want to apply additional patch from avg@ to better detect CPU
topology:
https://gitorious.org/~avg/freebsd/avgbsd/commit/6bca4a2e4854ea3fc275946a023db65c483cb9dd


test I conducted specifically for this patch did not showed much improvement...


If I understand right, this test runs thousands of threads sending and 
receiving data over the pipes. It is quite likely that all CPUs will be 
always busy and so load balancing is not really important in this test, 
What looks good is that more complicated new code is not slower then old 
one.


While this test seems very scheduler-intensive, it may depend on many 
other factors, such as syscall performance, context switch, etc. I'll 
try to play more with it.


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFT][patch] Scheduling for HTT and not only

2012-02-17 Thread Alexander Motin

On 02/15/12 21:54, Jeff Roberson wrote:

On Wed, 15 Feb 2012, Alexander Motin wrote:

I've decided to stop those cache black magic practices and focus on
things that really exist in this world -- SMT and CPU load. I've
dropped most of cache related things from the patch and made the rest
of things more strict and predictable:
http://people.freebsd.org/~mav/sched.htt34.patch


This looks great. I think there is value in considering the other
approach further but I would like to do this part first. It would be
nice to also add priority as a greater influence in the load balancing
as well.


I haven't got good idea yet about balancing priorities, but I've 
rewritten balancer itself. As soon as sched_lowest() / sched_highest() 
are more intelligent now, they allowed to remove topology traversing 
from the balancer itself. That should fix double-swapping problem, allow 
to keep some affinity while moving threads and make balancing more fair. 
I did number of tests running 4, 8, 9 and 16 CPU-bound threads on 8 
CPUs. With 4, 8 and 16 threads everything is stationary as it should. 
With 9 threads I see regular and random load move between all 8 CPUs. 
Measurements on 5 minutes run show deviation of only about 5 seconds. It 
is the same deviation as I see caused by only scheduling of 16 threads 
on 8 cores without any balancing needed at all. So I believe this code 
works as it should.


Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch

I plan this to be a final patch of this series (more to come :)) and if 
there will be no problems or objections, I am going to commit it (except 
some debugging KTRs) in about ten days. So now it's a good time for 
reviews and testing. :)


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFT][patch] Scheduling for HTT and not only

2012-02-17 Thread Alexander Motin

On 17.02.2012 18:53, Arnaud Lacombe wrote:

On Fri, Feb 17, 2012 at 11:29 AM, Alexander Motinm...@freebsd.org  wrote:

On 02/15/12 21:54, Jeff Roberson wrote:

On Wed, 15 Feb 2012, Alexander Motin wrote:

I've decided to stop those cache black magic practices and focus on
things that really exist in this world -- SMT and CPU load. I've
dropped most of cache related things from the patch and made the rest
of things more strict and predictable:
http://people.freebsd.org/~mav/sched.htt34.patch


This looks great. I think there is value in considering the other
approach further but I would like to do this part first. It would be
nice to also add priority as a greater influence in the load balancing
as well.


I haven't got good idea yet about balancing priorities, but I've rewritten
balancer itself. As soon as sched_lowest() / sched_highest() are more
intelligent now, they allowed to remove topology traversing from the
balancer itself. That should fix double-swapping problem, allow to keep some
affinity while moving threads and make balancing more fair. I did number of
tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8 and 16
threads everything is stationary as it should. With 9 threads I see regular
and random load move between all 8 CPUs. Measurements on 5 minutes run show
deviation of only about 5 seconds. It is the same deviation as I see caused
by only scheduling of 16 threads on 8 cores without any balancing needed at
all. So I believe this code works as it should.

Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch

I plan this to be a final patch of this series (more to come :)) and if
there will be no problems or objections, I am going to commit it (except
some debugging KTRs) in about ten days. So now it's a good time for reviews
and testing. :)


is there a place where all the patches are available ?


All my scheduler patches are cumulative, so all you need is only the 
last mentioned here sched.htt40.patch.


But in some cases, especially for multi-socket systems, to let it show 
its best, you may want to apply additional patch from avg@ to better 
detect CPU topology:

https://gitorious.org/~avg/freebsd/avgbsd/commit/6bca4a2e4854ea3fc275946a023db65c483cb9dd


I intend to run
some tests on a 1x2x2 (atom D510), 1x4x1 (core-2 quad), and eventually
a 2x8x2 platforms, against r231573. Results should hopefully be
available by the end of the week-end/middle of next week[0].

[0]: the D510 will likely be testing a couple of Linux kernel over the
week-end, and a FreeBSD run takes about 2.5 days to complete.


--
Alexander Motin
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: [RFT][patch] Scheduling for HTT and not only

2012-02-17 Thread Arnaud Lacombe
Hi,

On Fri, Feb 17, 2012 at 11:29 AM, Alexander Motin m...@freebsd.org wrote:
 On 02/15/12 21:54, Jeff Roberson wrote:

 On Wed, 15 Feb 2012, Alexander Motin wrote:

 I've decided to stop those cache black magic practices and focus on
 things that really exist in this world -- SMT and CPU load. I've
 dropped most of cache related things from the patch and made the rest
 of things more strict and predictable:
 http://people.freebsd.org/~mav/sched.htt34.patch


 This looks great. I think there is value in considering the other
 approach further but I would like to do this part first. It would be
 nice to also add priority as a greater influence in the load balancing
 as well.


 I haven't got good idea yet about balancing priorities, but I've rewritten
 balancer itself. As soon as sched_lowest() / sched_highest() are more
 intelligent now, they allowed to remove topology traversing from the
 balancer itself. That should fix double-swapping problem, allow to keep some
 affinity while moving threads and make balancing more fair. I did number of
 tests running 4, 8, 9 and 16 CPU-bound threads on 8 CPUs. With 4, 8 and 16
 threads everything is stationary as it should. With 9 threads I see regular
 and random load move between all 8 CPUs. Measurements on 5 minutes run show
 deviation of only about 5 seconds. It is the same deviation as I see caused
 by only scheduling of 16 threads on 8 cores without any balancing needed at
 all. So I believe this code works as it should.

 Here is the patch: http://people.freebsd.org/~mav/sched.htt40.patch

 I plan this to be a final patch of this series (more to come :)) and if
 there will be no problems or objections, I am going to commit it (except
 some debugging KTRs) in about ten days. So now it's a good time for reviews
 and testing. :)

is there a place where all the patches are available ? I intend to run
some tests on a 1x2x2 (atom D510), 1x4x1 (core-2 quad), and eventually
a 2x8x2 platforms, against r231573. Results should hopefully be
available by the end of the week-end/middle of next week[0].

 - Arnaud

[0]: the D510 will likely be testing a couple of Linux kernel over the
week-end, and a FreeBSD run takes about 2.5 days to complete.


 --
 Alexander Motin
 ___
 freebsd-hack...@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org