Re: Scaling and performance issues with FreeBSD 9 (& 10) on 4 socket systems

Remy Nonnenmacher Fri, 14 Jun 2013 04:03:02 -0700


On 06/14/13 04:05, David Xu wrote:

On 2013/06/13 20:01, Remy Nonnenmacher wrote:


On 06/13/13 13:32, Mark Felder wrote:

On Wed, 12 Jun 2013 17:58:49 -0500, David O'Brien <obr...@freebsd.org>
wrote:

We found FreeBSD 8.4 to perform better than FreeBSD 9.1, and Linux
considerably better than both on the same machine.


http://svnweb.freebsd.org/base?view=revision&revision=241246

The above link is likely why 8.4 is better than 9.1 on the same machine.

We've tried various things and haven't been able to explain why FreeBSD
isn't scaling on the new hardware.  Nor why it performs so much worse
than FreeBSD on the older "M2" machines.


The CPUs between those machines are quite different. I'm sure we're
looking at different cache sizes, different behavior for the
hyperthreading, etc. I'm sure others would be greatly interested in you
providing the same benchmark results for a recent snapshot of HEAD as
well.
_______________________________________________
freebsd-performance@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to
"freebsd-performance-unsubscr...@freebsd.org"


We had same problem on 4x12 cores (AMD) machines. After investigating
using hwpmc, it appears that performance was killed by a scheduler
function trying to find "least used cpu" that unfortunately works on
contended structures (ie: lots a cores are fighting to get works). A
solution was found by using artificially long queue of stuck process
(steal_thresh bumped to over 8) and by cpu affinity crafting.

Was a year ago and from my memory. I guess you may give a try to see if
it helps.

Disregard is a scheduler specialist contradicts.

Thanks.


AMD's cache is very different than Intel, AFAIK eariler than Bulldozer,
AMD's L3 is exclusive cache, util Bulldozer, AMD describes the L3 cache
as a “non-inclusive victim cache”, it is still different than Intel
which is inclusive.

"- In sched_pickcpu() change general logic of CPU selection. First
look for idle CPU, sharing last level cache with previously used one,
skipping SMT CPU groups. If none found, search all CPUs for the least
loaded
one, where the thread with its priority can run now. If none found, search
just for the least loaded CPU."

For exclusive cache, the L3 has second-hand data, not hot data, when a
thread is migrated, will have negative effect, its hot data is lost.
I'd prefer to search idle CPU from L2, then L3.

The problem was not really the excellent job done on cache locality viacpu detection. It was more a scaling problem with the number of coresthat exacerbate a contention when trying to steal works from othersqueues. Basically, what happened (I say happened because I've notretested recently), is that you may have 1 core running and 47 othersfighting in a loop where there is one winner and 46 losers, all of themplaying with locks, and O(N=48) loops. All in all, you see degradedperformance with little indication of a cause. This is where hwpmc is awonderfull tool...

Bumping up steal-thresh up changes the pattern. If it works for you,then the cause is probably the same.


_______________________________________________
freebsd-performance@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "freebsd-performance-unsubscr...@freebsd.org"

Re: Scaling and performance issues with FreeBSD 9 (& 10) on 4 socket systems

Reply via email to