On Wed, 15 Feb 2012, Alexander Motin wrote:
On 02/14/12 00:38, Alexander Motin wrote:
I see no much point in committing them sequentially, as they are quite
orthogonal. I need to make one decision. I am going on small vacation
next week. It will give time for thoughts to settle. May be I indeed
just clean previous patch a bit and commit it when I get back. I've
spent too much time trying to make these things formal and so far
results are not bad, but also not so brilliant as I would like. May be
it is indeed time to step back and try some more simple solution.
I've decided to stop those cache black magic practices and focus on things
that really exist in this world -- SMT and CPU load. I've dropped most of
cache related things from the patch and made the rest of things more strict
and predictable:
http://people.freebsd.org/~mav/sched.htt34.patch
This looks great. I think there is value in considering the other
approach further but I would like to do this part first. It would be nice
to also add priority as a greater influence in the load balancing as well.
This patch adds check to skip fast previous CPU selection if it's SMT
neighbor is in use, not just if no SMT present as in previous patches.
I've took affinity/preference algorithm from the first patch and improved it.
That makes pickcpu() to prefer previous core or it's neighbors in case of
equal load. That is very simple to keep it, but still should give cache hits.
I've changed the general algorithm of topology tree processing. First I am
looking for idle core on the same last-level cache as before, with affinity
to previous core or it's neighbors on higher level caches. Original code
could put additional thread on already busy core, while next socket is
completely idle. Now if there is no idle core on this cache, then all other
CPUs are checked.
CPU groups comparison now done in two steps: first, same as before, compared
summary load of all cores; but now, if it is equal, I am comparing load of
the less/most loaded cores. That should allow to differentiate whether load 2
really means 1+1 or 2+0. In that case group with 2+0 will be taken as more
loaded than one with 1+1, making group choice more grounded and predictable.
I've added randomization in case if all above factors are equal.
This all sounds good. I will need to review in detail but the approach
seems straightforward and fixes corner cases that are undesirable.
As before I've tested this on Core i7-870 with 4 physical and 8 logical cores
and Atom D525 with 2 physical and 4 logical cores. On Core i7 I've got
speedup up to 10-15% in super-smack MySQL and PostgreSQL indexed select for
2-8 threads and no penalty in other cases. pbzip2 shows up to 13% performance
increase for 2-5 threads and no penalty in other cases.
Can you also test buildworld or buildkernel with a -j value twice the
number of cores? This is an interesting case because it gets little
benefit from from affinity and really wants the best balancing possible.
It's also the first thing people will complain about if it slows.
Tests on Atom show mostly about the same performance as before in database
benchmarks: faster for 1 thread, slower for 2-3 and about the same for other
cases. Single stream network performance improved same as for the first
patch. That CPU is quite difficult to handle as with mix of effective SMT and
lack of L3 cache different scheduling approaches give different results in
different situations.
Specific performance numbers can be found here:
http://people.freebsd.org/~mav/bench.ods
Every point there includes at least 5 samples and except pbzip2 test that is
quite unstable with previous sources all are statistically valid.
Florian is now running alternative set of benchmarks on dual-socket hardware
without SMT.
Again, thank you very much for working on this.
Jeff
--
Alexander Motin
_______________________________________________
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"