Re: [RFT][patch] Scheduling for HTT and not only

Jeff Roberson Mon, 13 Feb 2012 13:59:00 -0800


On Mon, 13 Feb 2012, Alexander Motin wrote:

On 02/13/12 22:23, Jeff Roberson wrote:
On Mon, 13 Feb 2012, Alexander Motin wrote:
On 02/11/12 16:21, Alexander Motin wrote:
I've heavily rewritten the patch already. So at least some of the ideas
are already addressed. :) At this moment I am mostly satisfied with
results and after final tests today I'll probably publish new version.
It took more time, but finally I think I've put pieces together:
http://people.freebsd.org/~mav/sched.htt23.patch
I need some time to read and digest this. However, at first glance, a
global pickcpu lock will not be acceptable. Better to make a rarely
imperfect decision than too often cause contention.
On my tests it was opposite. Imperfect decisions under 60K MySQL requests persecond on 8 cores quite often caused two threads to be pushed to one CPU orto one physical core, causing up to 5-10% performance penalties. I've triedboth with and without lock and at least on 8-core machine difference wassignificant to add this. I understand that this is not good, but I have nomachine with hundred of CPUs to tell how will it work there. For really bigsystems it could be partitioned somehow, but that will also increase loadimbalance.

It would be preferable to refetch the load on the target cpu and restartthe selection if it has changed. Even this should have some maximum boundon the number of times it will spin and possibly be conditionally enabled.That two cpus are making the same decision indicates that the race windowis occuring and contention will be guaranteed. As you have tested on only8 cores that's not a good sign.

The patch is more complicated then previous one both logically and
computationally, but with growing CPU power and complexity I think we
can possibly spend some more time deciding how to spend time. :)


It is probably worth more cycles but we need to evaluate this much more
complex algorithm carefully to make sure that each of these new features
provides an advantage.

Problem is that doing half of things may not give full picture. How to doaffinity trying to save some percents, while SMT effect is times higher? Sametime too many unknown variables in applications behavior can easily make allof this pointless.

Patch formalizes several ideas of the previous code about how to
select CPU for running a thread and adds some new. It's main idea is
that I've moved from comparing raw integer queue lengths to
higher-resolution flexible values. That additional 8-bit precision
allows same time take into account many factors affecting performance.
Beside just choosing best from equally-loaded CPUs, with new code it
may even happen that because of SMT, cache affinity, etc, CPU with
more threads on it's queue will be reported as less loaded and opposite.

New code takes into account such factors:
- SMT sharing penalty.
- Cache sharing penalty.
- Cache affinity (with separate coefficients for last-level and other
level caches) to the:


We already used separate affinity values for different cache levels.
Keep in mind that if something else has run on a core the cache affinity
is lost in very short order. Trying too hard to preserve it beyond a few
ms never seems to pan out.

Previously it was only about timeout, that was IMHO pointless, as it isimpossible to predict when cache will be purged. It could be done inmicrosecond or second later, depending on application behavior.

This was not pointless. Eliminate it and see. The point is that aftersome time has elapsed the cache is almost certainly useless and we shouldselect the most appropriate cpu based on load, priority, etc. We don'thave perfect information for any of these algorithms. But as anapproximation it is useful to know whether affinity should even beconsidered. An improvement on this would be to look at the amount of timethe core has been idle since the selecting thread last ran rather thanjust the current load. Tell me what the point of selecting for affinityis if so much time has passed that valid cache contents are almostguaranteed to be lost?

- other running threads of it's process,
This is not really a great indicator of whether things should be
scheduled together or not. What workload are you targeting here?
When several threads accessing/modifying same shared memory. Like MySQLserver threads. I've noticed that on Atom CPU wit no L3 it is cheaper to movetwo threads to one physical core to share the cache then handle coherencyover the memory bus.

It can definitely be cheaper. But there are an equal number of caseswhere it will be more expensive. Some applications will have a lot ofcontention and shared state and these will want to be co-located. Otherswill simply want to get as much cache and cpu time as they can. There area number of papers that have been published on determining which is whichbased on cpu performance counters. I believe sun does this in particular.Another option that apple has pursued is to give the application theoption to mark threads as wanting to be close together or far away.

I think the particular heuristic you have here is too expensive andspecific to go in. The potential negative consequences are very big. Ifyou want to pursue apple or sun's approach to this problem I would beinterested in that.

- previous CPU where it was running,
- current CPU (usually where it was called from).
These two were also already used. Additionally:

+ * Hide part of the current thread
+ * load, hoping it or the scheduled
+ * one complete soon.
+ * XXX: We need more stats for this.

I had something like this before. Unfortunately interactive tasks are
allowed fairly aggressive bursts of cpu to account for things like xorg
and web browsers. Also, I tried this for ithreads but they can be very
expensive in some workloads so other cpus will idle as you try to
schedule behind an ithread.
As I have noted, this need more precise statistics about thread behavior.Present sampled statistics is almost useless there. Existing code alwaysprefers to run thread on current CPU if there is no other CPU with no load.That logic works very good when 8 MySQL threads and 8 clients working on 8CPUs, but a bit not so good in other situations.

You're speaking of the stathz based accounting? Or you want more precisestats about other things? We've talked for years about event basedaccounting rather than sampling but no one has implemented it. Please goahead if you would like. Keep in mind that cores can change frequency andtsc values may not be stable.

However, even with perfect stats, I'm not sure whether ignoring thecurrent load will be the right thing. I had some changes that took theinteractivity score into account to do this. If it is very very low, thenmaybe it makes sense.

All of these factors are configurable via sysctls, but I think
reasonable defaults should fit most.

Also, comparing to previous patch, I've resurrected optimized shortcut
in CPU selection for the case of SMT. Comparing to original code
having problems with this, I've added check for other logical cores
load that should make it safe and still very fast when there are less
running threads then physical cores.

I've tested in on Core i7 and Atom systems, but more interesting would
be to test it on multi-socket system with properly detected topology
to check benefits from affinity.

At this moment the main issue I see is that this patch affects only
time when thread is starting. If thread runs continuously, it will
stay where it was, even if due to situation change that is not very
effective (causes SMT sharing, etc). I haven't looked much on periodic
load balancer yet, but probably it could also be somehow improved.

What is your opinion, is it too over-engineered, or it is the right
way to go?


I think it's a little too much change all at once. I also believe that
the changes that try very hard to preserve affinity likely help a much
smaller number of cases than they hurt. I would prefer you do one piece
at a time and validate each step. There are a lot of good ideas in here
but good ideas don't always turn into results.

When each of these small steps can change everything and they are related,number of combinations to test grows rapidly. I am not going to commit thistomorrow. It is more like concept, that needs testing and evaluation.

I say this because I have tried nearly all of these heuristics indifferent forms. I don't object to the general idea of using a weightedscore to select the target cpu. However, I do think several of theseheuristics are problematic. While the current algorithm is far fromperfect it is the product of an incredible amount of testing andexperimentation. Significant changes to it are going to require an equalamount of effort to characterize and verify. And I do believe many piecescan be broken down and tested independently. For example, whether toignore interactive load on the core, or whether to lock pickcpu, etc. canall easily be independently tested in a number of workloads.

Do you intend to clean up and commit your last, simpler patch? I have noobjections to that and it simply fixes a bias in the load selectionalgorithm that shouldn't have existed.


Thanks,
Jeff


--
Alexander Motin

_______________________________________________
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Re: [RFT][patch] Scheduling for HTT and not only

Reply via email to