Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-15 Thread Andriy Gapon
on 15/04/2010 16:23 Adam Vande More said the following:
> Is is possible to add a tunable to the scheduler for it's aggressiveness
> in switching cores?

No idea; not a scheduler person.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-15 Thread Adam Vande More
On Thu, Apr 15, 2010 at 3:54 AM, Andriy Gapon  wrote:

> This is a good point.
> But on the other hand, it means that our scheduler doesn't do a perfect job
> here.  BTW, I use ULE.
> My observation is that when a number of CPU-intensive long running
> processes is
> less than or equal to number of cores, then the processes tend to stay on
> the
> same cores for a long time.
> But if the number of the processes is greater, then they seem to jump from
> core
> to core a lot.
> But I am not sure what would be an optimal strategy for that case.  If we
> try to
> keep some lucky processes on the same core, then cpu time might be shared
> unfairly.  Shuffling cores provides more fairness, but can hurt total
> performance.
>

Is is possible to add a tunable to the scheduler for it's aggressiveness in
switching cores?

-- 
Adam Vande More
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-15 Thread Andriy Gapon
on 14/04/2010 20:47 Adam Vande More said the following:
> I'm no expert Andriy, but it seems like if gotoblas
> implemented some of the FreeBSD optimizations then we'd be in the same
> ballpark.

This is a good point.
But on the other hand, it means that our scheduler doesn't do a perfect job
here.  BTW, I use ULE.
My observation is that when a number of CPU-intensive long running processes is
less than or equal to number of cores, then the processes tend to stay on the
same cores for a long time.
But if the number of the processes is greater, then they seem to jump from core
to core a lot.
But I am not sure what would be an optimal strategy for that case.  If we try to
keep some lucky processes on the same core, then cpu time might be shared
unfairly.  Shuffling cores provides more fairness, but can hurt total 
performance.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: HyperThreading makes worse to me (was Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920)

2010-04-15 Thread Adrian Chadd
May I make a suggestion?

Would you mind creating a shared google spreadsheet with your testing
results and a shared google document with the test setup?

I think having the data in an easily represented, easily shared medium
would be beneficial to everyone.


Adrian

On 15 April 2010 08:46, Maho NAKATA  wrote:
> Hi Andry and Adam
>
> My test again. No desktop, etc. I just run dgemm.
> Contrary to Adam's result, Hyper Threading makes the performance worse.
> all tests are done on Core i7 920 @ 2.67GHz. (TurboBoost @2.8GHz)
>
> Turbo Boost off, Hyper threading off: 82% (35GFlops)    [1]
> Turbo Boost off, Hyper threading off: 72% (30.5GFlops)  [2]
>
> Turbo Boost on,  Hyper threading on: 71% (32GFlops)    [3]
> Turbo Boost off, Hyper threading off: 84-89% (38-40GFlops) [4]
>
> ---my system---
> CPU: Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz (2683.44-MHz K8-class 
> CPU)
>  Origin = "GenuineIntel"  Id = 0x106a5  Stepping = 5
>  Features=0xbfebfbff
>  Features2=0x98e3bd
>  AMD Features=0x28100800
>  AMD Features2=0x1
>  TSC: P-state invariant
> real memory  = 12884901888 (12288 MB)
> avail memory = 12387717120 (11813 MB)
> ACPI APIC Table: <110909 APIC1026>
> FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
> FreeBSD/SMP: 1 package(s) x 4 core(s)
> ---my system---
>
> ---DETAILS---
> [1]
> % ./dgemm
> n: 3000
> time : 57.666717 or 16.339074
> Mflops : 33060.624827
> n: 3100
> time : 61.502677 or 16.597376
> Mflops : 35910.025544
> n: 3200
> time : 69.075401 or 19.199833
> Mflops : 34144.297133
> n: 3300
> time : 73.699540 or 19.633594
> Mflops : 36618.756539
> n: 3400
> time : 82.256194 or 22.373651
> Mflops : 35144.518837
> n: 3500
> time : 88.975662 or 24.118761
> Mflops : 35563.394249
> n: 3600
> time : 96.436652 or 26.027588
> Mflops : 35861.148385
> n: 3700
> [2]
> % ./dgemm
> n: 3000
> time : 139.622739 or 17.693806
> Mflops : 30529.327312
> n: 3100
> time : 154.344971 or 19.566886
> Mflops : 30460.247702
> n: 3200
> time : 169.507739 or 21.467100
> Mflops : 30538.116602
> n: 3300
> time : 186.363773 or 23.615281
> Mflops : 30444.600545
> n: 3400
> time : 203.798979 or 25.817667
> Mflops : 30456.322788
> n: 3500
> ...
> [3]
> % ./dgemm
> n: 3000
> time : 134.673079 or 16.958682
> Mflops : 31852.711082
> n: 3100
> time : 148.410085 or 18.663248
> Mflops : 31935.073574
> n: 3200
> time : 162.835473 or 20.468825
> Mflops : 32027.475770
> n: 3300
> time : 179.025370 or 22.479189
> Mflops : 31983.262501
> n: 3400
> time : 195.859710 or 24.663009
> Mflops : 31882.208788
> n: 3500
> [4]
> % ./dgemm
> n: 3000
> time : 54.259647 or 14.684309
> Mflops : 36786.204907
> n: 3100
> time : 60.899147 or 17.124599
> Mflops : 34804.447141
> n: 3200
> time : 64.295342 or 17.490787
> Mflops : 37480.577569
> n: 3300
> time : 69.781247 or 18.288840
> Mflops : 39311.284796
> n: 3400
> time : 79.234397 or 21.829736
> Mflops : 36020.187858
> n: 3500
> time : 83.905419 or 22.381237
> Mflops : 38324.289174
> n: 3600
> time : 92.195022 or 25.105942
> Mflops : 37177.621122
> n: 3700
> time : 97.718841 or 25.434243
> Mflops : 39841.319494
> n: 3800
> time : 105.740463 or 27.414029
> Mflops : 40042.592613
> n: 3900
> time : 113.980157 or 29.678505
> Mflops : 39984.635420
> n: 4000
> time : 122.941569 or 31.946174
> Mflops : 40077.412531
> n: 4100
> ---DETAILS---
>
>
> From: Adam Vande More 
> Subject: Re: How to reproduce: Re: Only 70% of theoretical peak performance 
> on FreeBSD 8/amd64, Corei7 920
> Date: Wed, 14 Apr 2010 11:34:45 -0500
>
>>> > time : 162.45 or 20.430651
>>> > Mflops : 32087.318295
>>> > n: 3300
>>> > time : 178.497079 or 22.446093
>>> > Mflops : 32030.420499
>>> > n: 3400
>>> > time : 195.550715 or 24.586152
>>> > Mflops : 31981.873273
>>> > n: 3500
>>> > time : 213.403379 or 26.825058
>>> > Mflops : 31975.513363
>>> > n: 3600
>>> > ...
>>> > above output is on Core i7 920 (2.66GHz; TurboBoost on)
>>>
>>> My results:
>>> $ ./dgemm
>>> n: 3000
>>> time : 54.151302 or 28.189781
>>> Mflops : 19162.263125
>>> n: 3100
>>> time : 60.157449 or 32.214141
>>> Mflops : 18501.570537
>>> n: 3200
>>> time : 65.753191 or 34.114872
>>> Mflops : 19216.393378
>>>
>>> CPU:
>>> CPU: Intel(R) Core(TM)2 Duo CPU     E7300  @ 2.66GHz (2653.35-MHz K8-class
>>> CPU)
>>>  Origin = "Genuin

Re: Linux static linked ver doesn't work on FBSD (Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920)

2010-04-15 Thread Garrett Cooper
On Wed, Apr 14, 2010 at 10:26 PM, Maho NAKATA  wrote:
> From: Pieter de Goeje 
> Subject: Re: How to reproduce: Re: Only 70% of theoretical peak performance 
> on FreeBSD 8/amd64, Corei7 920
> Date: Wed, 14 Apr 2010 16:05:18 +0200
>
>> I think the best test would be to run a statically compiled linux binary on
>> FreeBSD. That way the compiler settings are exactly the same.
>
> It is not possible for Linux amd64 binary to run on FreeBSD amd64,
> ...and not i386 version neither. GotoBLAS uses special systeml call.
>
> % ./dgemm
> linux_sys_futex: unknown op 265
> linux: pid 1264 (dgemm): syscall mbind not implemented
> n: 3000
> ^C
> just halt.

Yes, and while this isn't directly tied into numa, mbind(2),
mempolicy(2), and a few others use the same facilities that are
available via plain numa. I know because of messes I've tried to clean
up in these areas. I'm really not sure why this is using numa though
to be honest...
Thanks,
-Garrett
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: HyperThreading makes worse to me (was Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920)

2010-04-15 Thread Garrett Cooper
On Wed, Apr 14, 2010 at 9:21 PM, Ian Smith  wrote:
> On Wed, 14 Apr 2010, Garrett Cooper wrote:
>  > On Wed, Apr 14, 2010 at 7:49 PM, Garrett Cooper  wrote:
>  > > On Wed, Apr 14, 2010 at 5:46 PM, Maho NAKATA  wrote:
>  > >> Hi Andry and Adam
>  > >>
>  > >> My test again. No desktop, etc. I just run dgemm.
>  > >> Contrary to Adam's result, Hyper Threading makes the performance worse.
>  > >> all tests are done on Core i7 920 @ 2.67GHz. (TurboBoost @2.8GHz)
>  > >>
>  > >> Turbo Boost off, Hyper threading off: 82% (35GFlops)    [1]
>  > >> Turbo Boost off, Hyper threading off: 72% (30.5GFlops)  [2]
>
> Er, shouldn't one of those say HTT on?  and/or Turbo boost on?  Else
> they're both the same test as [4] but with different results?

There's a problem with 8.x+ cores reported by the kernel. For some odd
reason more recent Intel processors aren't reporting themselves as
HT-enabled when they have HT-cores (see: kern/145385).

I didn't look into the issue too hard, but since it does seem to be a
major performance loss perhaps I should; besides, it would be good
experience to put under my belt :].

>  > >> Turbo Boost on,  Hyper threading on: 71% (32GFlops)    [3]
>  > >> Turbo Boost off, Hyper threading off: 84-89% (38-40GFlops) [4]
>
> Clarification of all four possible test configs - 8 if you add pinning
> CPUs or not - might make this a bit clearer?
>
>  > > Doesn't this make sense? Hyperthreaded cores in Intel procs still
>  > > provide an incomplete set of registers as they're logical processors,
>  > > so I would expect for things to be slower if they're automatically run
>  > > on the SMT cores instead of the physical ones.
>
> Since we're talking FP, do HTT 'cores' share an FPU, or have their own?
> If contended, you'd have to expect worse (at least FP) performance, no?

   Ah, that's another excellent point. What instructions is dgemm
using -- pure integer based arithmetic, floating point arithmetic,
specialized operations that would benefit from using SIMD, etc?

>  > > Is there a weighting scheme to SCHED_ULE where logical processors
>  > > (like the SMT variety) get a lower score than real processors do, and
>  > > thus get scheduled for less intensive interrupting tasks, or maybe
>  > > just don't get scheduled in high use scenarios like it would if it was
>  > > a physical processor?
>  >
>  > Err... wait. Didn't see that the turbo boost results didn't scale
>  > linearly or align with one another until just a sec ago. Nevermind my
>  > previous comment.
>
> Waiting for the fog to lift ..

As am I. I don't know enough in this area, but I'm definitely open
to learning.

Thanks,
-Garrett
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-14 Thread Andriy Gapon
on 15/04/2010 04:20 Maho NAKATA said the following:
> Hi Andriy and Adam,
> 
> I did also the same thing as suggested. 
> 
> my conclusion: on Core i7 920, 2.66GHz, TurboBoost on, HyperThreading off,

So HyperThreading is off.

> then, pinned to each core like following
> 
> % procstat -t 1408
>   PIDTID COMM TDNAME   CPU  PRI STATE   WCHAN
>  1408 100160 dgemm-  3  190 run - 
>  1408 100161 dgemm-  2  190 run - 
>  1408 100162 dgemm-  2  190 run - 
>  1408 100163 dgemm-  1  189 run - 
>  1408 100164 dgemm-  0  190 run - 
>  1408 100165 dgemm-  3  189 run - 
>  1408 100166 dgemm-  1  190 run - 
>  1408 100167 dgemminitial thread 0  190 run -  

But there are still 8 threads.

Can you check how many threads you have on Linux with the same configuration?
Is it possible to tell GotoBLAS to use 4 threads?
If yes, can you also test that scenario?

Also, would it be possible for you to test recent 8-STABLE?
Just for the sake of experiment.
-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Linux static linked ver doesn't work on FBSD (Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920)

2010-04-14 Thread Maho NAKATA
From: Pieter de Goeje 
Subject: Re: How to reproduce: Re: Only 70% of theoretical peak performance on 
FreeBSD 8/amd64, Corei7 920
Date: Wed, 14 Apr 2010 16:05:18 +0200

> I think the best test would be to run a statically compiled linux binary on 
> FreeBSD. That way the compiler settings are exactly the same.

It is not possible for Linux amd64 binary to run on FreeBSD amd64,
...and not i386 version neither. GotoBLAS uses special systeml call.

% ./dgemm
linux_sys_futex: unknown op 265
linux: pid 1264 (dgemm): syscall mbind not implemented
n: 3000
^C
just halt.
-- Nakata Maho http://accc.riken.jp/maho/ , JA OOO http://ja.openoffice.org/
   Blog: http://blog.goo.ne.jp/nakatamaho/ , GPG key: 
http://accc.riken.jp/maho/maho.pgp.txt
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: HyperThreading makes worse to me (was Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920)

2010-04-14 Thread Ian Smith
On Wed, 14 Apr 2010, Garrett Cooper wrote:
 > On Wed, Apr 14, 2010 at 7:49 PM, Garrett Cooper  wrote:
 > > On Wed, Apr 14, 2010 at 5:46 PM, Maho NAKATA  wrote:
 > >> Hi Andry and Adam
 > >>
 > >> My test again. No desktop, etc. I just run dgemm.
 > >> Contrary to Adam's result, Hyper Threading makes the performance worse.
 > >> all tests are done on Core i7 920 @ 2.67GHz. (TurboBoost @2.8GHz)
 > >>
 > >> Turbo Boost off, Hyper threading off: 82% (35GFlops)    [1]
 > >> Turbo Boost off, Hyper threading off: 72% (30.5GFlops)  [2]

Er, shouldn't one of those say HTT on?  and/or Turbo boost on?  Else 
they're both the same test as [4] but with different results?

 > >> Turbo Boost on,  Hyper threading on: 71% (32GFlops)    [3]
 > >> Turbo Boost off, Hyper threading off: 84-89% (38-40GFlops) [4]

Clarification of all four possible test configs - 8 if you add pinning 
CPUs or not - might make this a bit clearer?

 > > Doesn't this make sense? Hyperthreaded cores in Intel procs still
 > > provide an incomplete set of registers as they're logical processors,
 > > so I would expect for things to be slower if they're automatically run
 > > on the SMT cores instead of the physical ones.

Since we're talking FP, do HTT 'cores' share an FPU, or have their own?  
If contended, you'd have to expect worse (at least FP) performance, no?

 > > Is there a weighting scheme to SCHED_ULE where logical processors
 > > (like the SMT variety) get a lower score than real processors do, and
 > > thus get scheduled for less intensive interrupting tasks, or maybe
 > > just don't get scheduled in high use scenarios like it would if it was
 > > a physical processor?
 > 
 > Err... wait. Didn't see that the turbo boost results didn't scale
 > linearly or align with one another until just a sec ago. Nevermind my
 > previous comment.

Waiting for the fog to lift ..

cheers, Ian___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: HyperThreading makes worse to me (was Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920)

2010-04-14 Thread Garrett Cooper
On Wed, Apr 14, 2010 at 7:49 PM, Garrett Cooper  wrote:
> On Wed, Apr 14, 2010 at 5:46 PM, Maho NAKATA  wrote:
>> Hi Andry and Adam
>>
>> My test again. No desktop, etc. I just run dgemm.
>> Contrary to Adam's result, Hyper Threading makes the performance worse.
>> all tests are done on Core i7 920 @ 2.67GHz. (TurboBoost @2.8GHz)
>>
>> Turbo Boost off, Hyper threading off: 82% (35GFlops)    [1]
>> Turbo Boost off, Hyper threading off: 72% (30.5GFlops)  [2]
>>
>> Turbo Boost on,  Hyper threading on: 71% (32GFlops)    [3]
>> Turbo Boost off, Hyper threading off: 84-89% (38-40GFlops) [4]
>
> Doesn't this make sense? Hyperthreaded cores in Intel procs still
> provide an incomplete set of registers as they're logical processors,
> so I would expect for things to be slower if they're automatically run
> on the SMT cores instead of the physical ones.
>
> Is there a weighting scheme to SCHED_ULE where logical processors
> (like the SMT variety) get a lower score than real processors do, and
> thus get scheduled for less intensive interrupting tasks, or maybe
> just don't get scheduled in high use scenarios like it would if it was
> a physical processor?

Err... wait. Didn't see that the turbo boost results didn't scale
linearly or align with one another until just a sec ago. Nevermind my
previous comment.
-Garrett
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: HyperThreading makes worse to me (was Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920)

2010-04-14 Thread Garrett Cooper
On Wed, Apr 14, 2010 at 5:46 PM, Maho NAKATA  wrote:
> Hi Andry and Adam
>
> My test again. No desktop, etc. I just run dgemm.
> Contrary to Adam's result, Hyper Threading makes the performance worse.
> all tests are done on Core i7 920 @ 2.67GHz. (TurboBoost @2.8GHz)
>
> Turbo Boost off, Hyper threading off: 82% (35GFlops)    [1]
> Turbo Boost off, Hyper threading off: 72% (30.5GFlops)  [2]
>
> Turbo Boost on,  Hyper threading on: 71% (32GFlops)    [3]
> Turbo Boost off, Hyper threading off: 84-89% (38-40GFlops) [4]

Doesn't this make sense? Hyperthreaded cores in Intel procs still
provide an incomplete set of registers as they're logical processors,
so I would expect for things to be slower if they're automatically run
on the SMT cores instead of the physical ones.

Is there a weighting scheme to SCHED_ULE where logical processors
(like the SMT variety) get a lower score than real processors do, and
thus get scheduled for less intensive interrupting tasks, or maybe
just don't get scheduled in high use scenarios like it would if it was
a physical processor?

Thanks,
-Garrett
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-14 Thread Maho NAKATA
Hi Adam,

From: Adam Vande More 
Subject: Re: How to reproduce: Re: Only 70% of theoretical peak performance on 
FreeBSD 8/amd64, Corei7 920
Date: Wed, 14 Apr 2010 12:47:31 -0500

> Since this is a full fledged desktop environment, 90% utilization seems
> pretty good.  

No, I don't think so. Even on Ubuntu,
mine is running on a full desktop environment, GotoBLAS's performance
is about 95% using dgemm. dgemm on Linux is lot more stabler than FreeBSD
and clearly faster.

on Ubuntu
$ ./dgemm
n: 3000
time : 51.18 or 12.795519 
Mflops : 42216.341930
n: 3100
time : 56.28 or 14.261719 
Mflops : 41791.049205
n: 3200
time : 61.35 or 15.631380 
Mflops : 41939.023080
n: 3300
time : 67.79 or 17.247202 
Mflops : 41685.474166
n: 3400
time : 73.80 or 18.471321 
Mflops : 42569.300032
n: 3500
time : 81.48 or 20.781936 
Mflops : 41273.585044
n: 3600
time : 88.17 or 22.816965 
Mflops : 40907.246233
n: 3700
time : 95.21 or 23.864101 
Mflops : 42462.684969
n: 3800

thanks
-- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/ 
   Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-14 Thread Maho NAKATA
Hi Andriy and Adam,

I did also the same thing as suggested. 

my conclusion: on Core i7 920, 2.66GHz, TurboBoost on, HyperThreading off,
My result of dgemm GotoBLAS performance was following.

*summary of result 
36-39GFlops 81-87% of peak performance without pinning
35-40GFlops 78-89% of peak performance with pinning

my observation
* performance is somewhat unstable like 35GFlops then next calculation
is 40GFlops...and flips etc. jittering is observed.
* pinning makes performance somewhat stabler, but we don't gain a bit more.

Details.
First I ran
%./dgemm
n: 3500
time : 84.431008 or 22.428125 
Mflops : 38244.168629
n: 3600
time : 90.162220 or 23.440381 
Mflops : 39819.284422
n: 3700
time : 101.427504 or 27.404345 
Mflops : 36977.121646

Note: 36-39GFlops 81-87% of peak performance

then, pinned to each core like following

% procstat -t 1408
  PIDTID COMM TDNAME   CPU  PRI STATE   WCHAN
 1408 100160 dgemm-  3  190 run - 
 1408 100161 dgemm-  2  190 run - 
 1408 100162 dgemm-  2  190 run - 
 1408 100163 dgemm-  1  189 run - 
 1408 100164 dgemm-  0  190 run - 
 1408 100165 dgemm-  3  189 run - 
 1408 100166 dgemm-  1  190 run - 
 1408 100167 dgemminitial thread 0  190 run -  

% cpuset -t 100160 -l 0
% cpuset -t 100161 -l 0
% cpuset -t 100162 -l 1
% cpuset -t 100163 -l 1
% cpuset -t 100164 -l 2
% cpuset -t 100165 -l 2
% cpuset -t 100166 -l 3
% cpuset -t 100167 -l 3
then,
% procstat -t 1408
  PIDTID COMM TDNAME   CPU  PRI STATE   WCHAN
 1408 100160 dgemm-  0  191 run - 
 1408 100161 dgemm-  0  191 run - 
 1408 100162 dgemm-  1  190 run - 
 1408 100163 dgemm-  1  190 run - 
 1408 100164 dgemm-  2  190 run - 
 1408 100165 dgemm-  2  190 run - 
 1408 100166 dgemm-  3  190 run - 
 1408 100167 dgemminitial thread 3  190 run -   

n: 4000
time : 121.907696 or 31.475052 
Mflops : 40677.295630
n: 4100
time : 139.842701 or 38.702532 
Mflops : 35624.444587
n: 4200
time : 143.622179 or 36.725949 
Mflops : 40356.011158
n: 4300
time : 153.742976 or 39.465752 
Mflops : 40301.013511
n: 4400
time : 164.919566 or 42.380653 
Mflops : 40208.611317
n: 4500
time : 175.930335 or 45.422572 
Mflops : 40132.139469

Thanks

From: Adam Vande More 
Subject: Re: How to reproduce: Re: Only 70% of theoretical peak performance on 
FreeBSD 8/amd64, Corei7 920
Date: Wed, 14 Apr 2010 12:47:31 -0500

> On Wed, Apr 14, 2010 at 11:51 AM, Andriy Gapon  wrote:
> 
>> on 14/04/2010 19:45 Adam Vande More said the following:
>> >
>> > also if I run cpuset on the dgemm then the utilization is basically at
>> > the theoretical max for one core so at least that part is working.
>>
>> You can also try procstat -t  to find out thread IDs and cpuset -t to
>> pin the
>> threads to the cores.
>>
> 
> it gets to around 90% doing that.
> 
> time : 103.617271 or 27.140992
> Mflops : 47172.925449
> n: 4100
> time : 113.910669 or 30.520677
> Mflops : 45174.496186
> n: 4200
> time : 121.880695 or 32.068070
> Mflops : 46217.711013
> n: 4300
> 
> tried a couple of different thread orders but didn't seem to make a
> difference.
> 
> galacticdominator% procstat -t 1922
>   PIDTID COMM TDNAME   CPU  PRI STATE   WCHAN
>  1922 100092 dgemminitial thread 0  190 run -
>  1922 100268 dgemm-  1  190 run -
>  1922 100270 dgemm-  1  191 run -
>  1922 100272 dgemm-  3  190 run -
>  1922 100273 dgemm-  2  191 run -
>  1922 100274 dgemm-  2  191 run -
>  1922 100282 dgemm-  0  190 run -
>  1922 100283 dgemm-  3  190 run -
> 
> galacticdominator% cpuset -t 100092 -l 0
> galacticdominator% cpuset -t 100268 -l 1
> galacticdominator% cpuset -t 100270 -l 2
> galacticdominator% cpuset -t 100272 -l 3
> galacticdominator% cpuset -t 100273 -l 0
> galacticdominator% cpuset -t 100274 -l 1
> galacticdominator% cpuset -t 100282 -l 2
> galacticdominator% cpuset -t 100283 -l 3
> 
> 
> galacticdominator% cpuset -t 100092 -l 0
> galacticdominator% cpuset -t 100268 -l 0
> galacticdomin

Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-14 Thread Maho NAKATA
opps I missed this e-mail...

From: Adam Vande More 
Subject: Re: How to reproduce: Re: Only 70% of theoretical peak performance on 
FreeBSD 8/amd64, Corei7 920
Date: Wed, 14 Apr 2010 11:45:04 -0500

> On Wed, Apr 14, 2010 at 11:34 AM, Adam Vande More 
> wrote:
> 
>>
>>
>>
>> That's about 67% utilization, turning off HTT drops it more.  HTT on the
>> newer cores is good, not bad.
>>
> 
> Well that was completely contrarty to some tests I'd run when I first got
> the cpu.
> 
> With HTT off:
> 
> n: 3000
> time : 44.705516 or 11.760183
> Mflops : 45932.959253
> n: 3100
> time : 50.598581 or 14.270123
> Mflops : 41766.437458
> n: 3200
> time : 55.748192 or 15.780977
> Mflops : 41541.458400
> n: 3300
> time : 62.072217 or 17.441431
> Mflops : 41221.262070
> n: 3400
> 
> so that's about 79% right there.
> 
> also if I run cpuset on the dgemm then the utilization is basically at the
> theoretical max for one core so at least that part is working.
> 
> -- 
> Adam Vande More
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


HyperThreading makes worse to me (was Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920)

2010-04-14 Thread Maho NAKATA
Hi Andry and Adam

My test again. No desktop, etc. I just run dgemm.
Contrary to Adam's result, Hyper Threading makes the performance worse.
all tests are done on Core i7 920 @ 2.67GHz. (TurboBoost @2.8GHz)

Turbo Boost off, Hyper threading off: 82% (35GFlops)[1]
Turbo Boost off, Hyper threading off: 72% (30.5GFlops)  [2]

Turbo Boost on,  Hyper threading on: 71% (32GFlops)[3]
Turbo Boost off, Hyper threading off: 84-89% (38-40GFlops) [4]

---my system---
CPU: Intel(R) Core(TM) i7 CPU 920  @ 2.67GHz (2683.44-MHz K8-class CPU)
  Origin = "GenuineIntel"  Id = 0x106a5  Stepping = 5
  
Features=0xbfebfbff
  
Features2=0x98e3bd
  AMD Features=0x28100800
  AMD Features2=0x1
  TSC: P-state invariant
real memory  = 12884901888 (12288 MB)
avail memory = 12387717120 (11813 MB)
ACPI APIC Table: <110909 APIC1026>
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
FreeBSD/SMP: 1 package(s) x 4 core(s)
---my system---

---DETAILS---
[1]
% ./dgemm
n: 3000
time : 57.666717 or 16.339074 
Mflops : 33060.624827
n: 3100
time : 61.502677 or 16.597376 
Mflops : 35910.025544
n: 3200
time : 69.075401 or 19.199833 
Mflops : 34144.297133
n: 3300
time : 73.699540 or 19.633594 
Mflops : 36618.756539
n: 3400
time : 82.256194 or 22.373651 
Mflops : 35144.518837
n: 3500
time : 88.975662 or 24.118761 
Mflops : 35563.394249
n: 3600
time : 96.436652 or 26.027588 
Mflops : 35861.148385
n: 3700
[2]
% ./dgemm
n: 3000
time : 139.622739 or 17.693806 
Mflops : 30529.327312
n: 3100
time : 154.344971 or 19.566886 
Mflops : 30460.247702
n: 3200
time : 169.507739 or 21.467100 
Mflops : 30538.116602
n: 3300
time : 186.363773 or 23.615281 
Mflops : 30444.600545
n: 3400
time : 203.798979 or 25.817667 
Mflops : 30456.322788
n: 3500
...
[3]
% ./dgemm
n: 3000
time : 134.673079 or 16.958682 
Mflops : 31852.711082
n: 3100
time : 148.410085 or 18.663248 
Mflops : 31935.073574
n: 3200
time : 162.835473 or 20.468825 
Mflops : 32027.475770
n: 3300
time : 179.025370 or 22.479189 
Mflops : 31983.262501
n: 3400
time : 195.859710 or 24.663009 
Mflops : 31882.208788
n: 3500
[4]
% ./dgemm
n: 3000
time : 54.259647 or 14.684309 
Mflops : 36786.204907
n: 3100
time : 60.899147 or 17.124599 
Mflops : 34804.447141
n: 3200
time : 64.295342 or 17.490787 
Mflops : 37480.577569
n: 3300
time : 69.781247 or 18.288840 
Mflops : 39311.284796
n: 3400
time : 79.234397 or 21.829736 
Mflops : 36020.187858
n: 3500
time : 83.905419 or 22.381237 
Mflops : 38324.289174
n: 3600
time : 92.195022 or 25.105942 
Mflops : 37177.621122
n: 3700
time : 97.718841 or 25.434243 
Mflops : 39841.319494
n: 3800
time : 105.740463 or 27.414029 
Mflops : 40042.592613
n: 3900
time : 113.980157 or 29.678505 
Mflops : 39984.635420
n: 4000
time : 122.941569 or 31.946174 
Mflops : 40077.412531
n: 4100
---DETAILS---


From: Adam Vande More 
Subject: Re: How to reproduce: Re: Only 70% of theoretical peak performance on 
FreeBSD 8/amd64, Corei7 920
Date: Wed, 14 Apr 2010 11:34:45 -0500

>> > time : 162.45 or 20.430651
>> > Mflops : 32087.318295
>> > n: 3300
>> > time : 178.497079 or 22.446093
>> > Mflops : 32030.420499
>> > n: 3400
>> > time : 195.550715 or 24.586152
>> > Mflops : 31981.873273
>> > n: 3500
>> > time : 213.403379 or 26.825058
>> > Mflops : 31975.513363
>> > n: 3600
>> > ...
>> > above output is on Core i7 920 (2.66GHz; TurboBoost on)
>>
>> My results:
>> $ ./dgemm
>> n: 3000
>> time : 54.151302 or 28.189781
>> Mflops : 19162.263125
>> n: 3100
>> time : 60.157449 or 32.214141
>> Mflops : 18501.570537
>> n: 3200
>> time : 65.753191 or 34.114872
>> Mflops : 19216.393378
>>
>> CPU:
>> CPU: Intel(R) Core(TM)2 Duo CPU E7300  @ 2.66GHz (2653.35-MHz K8-class
>> CPU)
>>  Origin = "GenuineIntel"  Id = 0x10676  Stepping = 6
>>
>>
>> Features=0xbfebfbff
>>
>>  
>> Features2=0x8e39d
>>  AMD Features=0x20100800
>>  AMD Features2=0x1
>>  TSC: P-state invariant
>> ⋮
>> FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
>> FreeBSD/SMP: 1 package(s) x 2 core(s)
>>
>> FreeBSD:
>> FreeBSD 8.0-STABLE r205070 amd64
>>
>> Please note that the system was not dedicated to the test, I had
>> Xorg+KDE3+thunderbird+skype+kopete+konsole(s) plus a bunch of daemons
>> running.
>> That probably explains irregularities in the results.
>>
>> I am not sure how exactly theoretical maximum should be calculated, I used
>> 2 *
>> 2.66G * 4 ≈ 21.3G.
>> And so 19.2G / 21.3G ≈ 90%.
>>
>> Not as bad as what you get.
>> Although not as good as what you report for Linux.
>> But given the impurity and imprecision of my test…
>>
>> P.S. the machine is two-core obviously :-)
>> Don't have anything with more cpus/cores handy.
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"

Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-14 Thread Maho NAKATA
From: Andriy Gapon 
Subject: Re: How to reproduce: Re: Only 70% of theoretical peak performance on 
FreeBSD 8/amd64, Corei7 920
Date: Wed, 14 Apr 2010 16:19:13 +0300

> on 14/04/2010 02:21 Maho NAKATA said the following:
>> 2. install ports/math/gotoblas (manual download required)
>>  make install 
> 
> 
> Do you know how gotoblas on Linux was obtained?
Yes. Just download the archive.

> Was it built from source?
Yes.
> Has it come pre-packaged?
No.

> If so, can you find out details of its build configuration?

I'm not sure
I build like following on Ubuntu 9.10 amd64.

$ tar xvfz GotoBLAS2-1.13.tar.gz
$ cd GotoBLAS2
$ ./quickbuild.64bit
ln -fs libgoto2_nehalemp-r1.13.a libgoto2.a
for d in interface driver/level2 driver/level3 driver/others kernel  lapack ; \
do if test -d $d; then \
  make -j 8 -C $d libs || exit 1 ; \
fi; \
done
make[1]: Entering directory `/home/maho/a/GotoBLAS2/interface'
gcc -O2 -DEXPRECISION -m128bit-long-double -Wall -m64 -DF_INTERFACE_GFORT -fPIC 
 -DSMP_SERVER -DMAX_CPU_NUMBER=8 -DASMNAME=saxpy -DASMFNAME=saxpy_ 
-DNAME=saxpy_ -DCNAME=saxpy -DCHAR_NAME=\"saxpy_\" -DCHAR_CNAME=\"saxpy\" -I.. 
-I. -UDOUBLE  -UCOMPLEX -c axpy.c -o saxpy.o
gcc -O2 -DEXPRECISION -m128bit-long-double -Wall -m64 -DF_INTERFACE_GFORT -fPIC 
 -DSMP_SERVER -DMAX_CPU_NUMBER=8 -DASMNAME=sswap -DASMFNAME=sswap_ 
-DNAME=sswap_ -DCNAME=sswap -DCHAR_NAME=\"sswap_\" -DCHAR_CNAME=\"sswap\" -I.. 
-I. -UDOUBLE  -UCOMPLEX -c swap.c -o sswap.o
gcc -O2 -DEXPRECISION -m128bit-long-double -Wall -m64 -DF_INTERFACE_GFORT -fPIC 
 -DSMP_SERVER -DMAX_CPU_NUMBER=8 -DASMNAME=scopy -DASMFNAME=scopy_ 
-DNAME=scopy_ -DCNAME=scopy -DCHAR_NAME=\"scopy_\" -DCHAR_CNAME=\"scopy\" -I.. 
-I. -UDOUBLE  -UCOMPLEX -c copy.c -o scopy.o
gcc -O2 -DEXPRECISION -m128bit-long-double -Wall -m64 -DF_INTERFACE_GFORT -fPIC 
 -DSMP_SERVER -DMAX_CPU_NUMBER=8 -DASMNAME=sscal -DASMFNAME=sscal_ 
-DNAME=sscal_ -DCNAME=sscal -DCHAR_NAME=\"sscal_\" -DCHAR_CNAME=\"sscal\" -I.. 
-I. -UDOUBLE  -UCOMPLEX -c scal.c -o sscal.o


-- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/ 
   Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-14 Thread Adam Vande More
On Wed, Apr 14, 2010 at 11:51 AM, Andriy Gapon  wrote:

> on 14/04/2010 19:45 Adam Vande More said the following:
> >
> > also if I run cpuset on the dgemm then the utilization is basically at
> > the theoretical max for one core so at least that part is working.
>
> You can also try procstat -t  to find out thread IDs and cpuset -t to
> pin the
> threads to the cores.
>

it gets to around 90% doing that.

time : 103.617271 or 27.140992
Mflops : 47172.925449
n: 4100
time : 113.910669 or 30.520677
Mflops : 45174.496186
n: 4200
time : 121.880695 or 32.068070
Mflops : 46217.711013
n: 4300

tried a couple of different thread orders but didn't seem to make a
difference.

galacticdominator% procstat -t 1922
  PIDTID COMM TDNAME   CPU  PRI STATE   WCHAN
 1922 100092 dgemminitial thread 0  190 run -
 1922 100268 dgemm-  1  190 run -
 1922 100270 dgemm-  1  191 run -
 1922 100272 dgemm-  3  190 run -
 1922 100273 dgemm-  2  191 run -
 1922 100274 dgemm-  2  191 run -
 1922 100282 dgemm-  0  190 run -
 1922 100283 dgemm-  3  190 run -

galacticdominator% cpuset -t 100092 -l 0
galacticdominator% cpuset -t 100268 -l 1
galacticdominator% cpuset -t 100270 -l 2
galacticdominator% cpuset -t 100272 -l 3
galacticdominator% cpuset -t 100273 -l 0
galacticdominator% cpuset -t 100274 -l 1
galacticdominator% cpuset -t 100282 -l 2
galacticdominator% cpuset -t 100283 -l 3


galacticdominator% cpuset -t 100092 -l 0
galacticdominator% cpuset -t 100268 -l 0
galacticdominator% cpuset -t 100270 -l 1
galacticdominator% cpuset -t 100272 -l 1
galacticdominator% cpuset -t 100273 -l 2
galacticdominator% cpuset -t 100274 -l 2
galacticdominator% cpuset -t 100282 -l 3
galacticdominator% cpuset -t 100283 -l 3


This is from the second set:

time : 150.348850 or 40.488350
Mflops : 45022.951141
n: 4600
time : 161.968982 or 43.589618
Mflops : 44669.884500
n: 4700

Since this is a full fledged desktop environment, 90% utilization seems
pretty good.  I'm no expert Andriy, but it seems like if gotoblas
implemented some of the FreeBSD optimizations then we'd be in the same
ballpark.


-- 
Adam Vande More
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-14 Thread Andriy Gapon
on 14/04/2010 19:45 Adam Vande More said the following:
> 
> also if I run cpuset on the dgemm then the utilization is basically at
> the theoretical max for one core so at least that part is working.

You can also try procstat -t  to find out thread IDs and cpuset -t to pin 
the
threads to the cores.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-14 Thread Adam Vande More
On Wed, Apr 14, 2010 at 11:34 AM, Adam Vande More wrote:

>
>
>
> That's about 67% utilization, turning off HTT drops it more.  HTT on the
> newer cores is good, not bad.
>

Well that was completely contrarty to some tests I'd run when I first got
the cpu.

With HTT off:

n: 3000
time : 44.705516 or 11.760183
Mflops : 45932.959253
n: 3100
time : 50.598581 or 14.270123
Mflops : 41766.437458
n: 3200
time : 55.748192 or 15.780977
Mflops : 41541.458400
n: 3300
time : 62.072217 or 17.441431
Mflops : 41221.262070
n: 3400

so that's about 79% right there.

also if I run cpuset on the dgemm then the utilization is basically at the
theoretical max for one core so at least that part is working.

-- 
Adam Vande More
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-14 Thread Adam Vande More
On Wed, Apr 14, 2010 at 10:26 AM, Andriy Gapon  wrote:

> on 14/04/2010 02:21 Maho NAKATA said the following:
> > 4. run dgemm.
> > % ./dgemm
> > n: 3000
> > time : 134.648208 or 16.910525
> > Mflops : 31943.419695
> > n: 3100
> > time : 148.122279 or 18.615284
> > Mflops : 32017.357408
> > n: 3200
> > time : 162.45 or 20.430651
> > Mflops : 32087.318295
> > n: 3300
> > time : 178.497079 or 22.446093
> > Mflops : 32030.420499
> > n: 3400
> > time : 195.550715 or 24.586152
> > Mflops : 31981.873273
> > n: 3500
> > time : 213.403379 or 26.825058
> > Mflops : 31975.513363
> > n: 3600
> > ...
> > above output is on Core i7 920 (2.66GHz; TurboBoost on)
>
> My results:
> $ ./dgemm
> n: 3000
> time : 54.151302 or 28.189781
> Mflops : 19162.263125
> n: 3100
> time : 60.157449 or 32.214141
> Mflops : 18501.570537
> n: 3200
> time : 65.753191 or 34.114872
> Mflops : 19216.393378
>
> CPU:
> CPU: Intel(R) Core(TM)2 Duo CPU E7300  @ 2.66GHz (2653.35-MHz K8-class
> CPU)
>  Origin = "GenuineIntel"  Id = 0x10676  Stepping = 6
>
>
> Features=0xbfebfbff
>
>  Features2=0x8e39d
>  AMD Features=0x20100800
>  AMD Features2=0x1
>  TSC: P-state invariant
> ⋮
> FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
> FreeBSD/SMP: 1 package(s) x 2 core(s)
>
> FreeBSD:
> FreeBSD 8.0-STABLE r205070 amd64
>
> Please note that the system was not dedicated to the test, I had
> Xorg+KDE3+thunderbird+skype+kopete+konsole(s) plus a bunch of daemons
> running.
> That probably explains irregularities in the results.
>
> I am not sure how exactly theoretical maximum should be calculated, I used
> 2 *
> 2.66G * 4 ≈ 21.3G.
> And so 19.2G / 21.3G ≈ 90%.
>
> Not as bad as what you get.
> Although not as good as what you report for Linux.
> But given the impurity and imprecision of my test…
>
> P.S. the machine is two-core obviously :-)
> Don't have anything with more cpus/cores handy.
>
> P.P.S. Having _only glimpsed_ at the source I think that there are some
> things
> that GotoBLAS doesn't try to do on FreeBSD that it tries to do on Linux.
> Like setting CPU-affinity for the threads, or avoiding HTT pseudo-cores.
> Those things are possible on FreeBSD.
> Perhaps, there are more things like that.
>
>
Mine is also a live desktop enviro, kde4+

n: 3000
time : 116.377609 or 16.696066
Mflops : 32353.729042
n: 3100
time : 127.230336 or 17.274867
Mflops : 34501.695325
n: 3200
time : 139.018175 or 18.342056
Mflops : 35741.074976
n: 3300
time : 152.519365 or 20.154714
Mflops : 35671.942364
n: 3400
time : 166.248145 or 21.952426
Mflops : 35818.874941
n: 3500
time : 182.565385 or 24.492597
Mflops : 35020.581786
n: 3600
time : 198.551018 or 26.906992
Mflops : 34689.094992
n: 3700
time : 215.428919 or 28.574964
Mflops : 35462.294838
n: 3800
^C

CPU: Intel(R) Core(TM) i7 CPU 870  @ 2.93GHz (3313.71-MHz K8-class
CPU)
  Origin = "GenuineIntel"  Id = 0x106e5  Family = 6  Model = 1e  Stepping =
5

Features=0xbfebfbff

Features2=0x98e3fd
  AMD Features=0x28100800
  AMD Features2=0x1
  TSC: P-state invariant

That's about 67% utilization, turning off HTT drops it more.  HTT on the
newer cores is good, not bad.





-- 
Adam Vande More
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-14 Thread Andriy Gapon
on 14/04/2010 02:21 Maho NAKATA said the following:
> 4. run dgemm. 
> % ./dgemm
> n: 3000
> time : 134.648208 or 16.910525 
> Mflops : 31943.419695
> n: 3100
> time : 148.122279 or 18.615284 
> Mflops : 32017.357408
> n: 3200
> time : 162.45 or 20.430651 
> Mflops : 32087.318295
> n: 3300
> time : 178.497079 or 22.446093 
> Mflops : 32030.420499
> n: 3400
> time : 195.550715 or 24.586152 
> Mflops : 31981.873273
> n: 3500
> time : 213.403379 or 26.825058 
> Mflops : 31975.513363
> n: 3600
> ...
> above output is on Core i7 920 (2.66GHz; TurboBoost on)

My results:
$ ./dgemm
n: 3000
time : 54.151302 or 28.189781
Mflops : 19162.263125
n: 3100
time : 60.157449 or 32.214141
Mflops : 18501.570537
n: 3200
time : 65.753191 or 34.114872
Mflops : 19216.393378

CPU:
CPU: Intel(R) Core(TM)2 Duo CPU E7300  @ 2.66GHz (2653.35-MHz K8-class CPU)
  Origin = "GenuineIntel"  Id = 0x10676  Stepping = 6

Features=0xbfebfbff
  Features2=0x8e39d
  AMD Features=0x20100800
  AMD Features2=0x1
  TSC: P-state invariant
⋮
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
FreeBSD/SMP: 1 package(s) x 2 core(s)

FreeBSD:
FreeBSD 8.0-STABLE r205070 amd64

Please note that the system was not dedicated to the test, I had
Xorg+KDE3+thunderbird+skype+kopete+konsole(s) plus a bunch of daemons running.
That probably explains irregularities in the results.

I am not sure how exactly theoretical maximum should be calculated, I used 2 *
2.66G * 4 ≈ 21.3G.
And so 19.2G / 21.3G ≈ 90%.

Not as bad as what you get.
Although not as good as what you report for Linux.
But given the impurity and imprecision of my test…

P.S. the machine is two-core obviously :-)
Don't have anything with more cpus/cores handy.

P.P.S. Having _only glimpsed_ at the source I think that there are some things
that GotoBLAS doesn't try to do on FreeBSD that it tries to do on Linux.
Like setting CPU-affinity for the threads, or avoiding HTT pseudo-cores.
Those things are possible on FreeBSD.
Perhaps, there are more things like that.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-14 Thread Pieter de Goeje
On Wednesday 14 April 2010 15:19:13 Andriy Gapon wrote:
> on 14/04/2010 02:21 Maho NAKATA said the following:
> > 2. install ports/math/gotoblas (manual download required)
> >  make install
> 
> Do you know how gotoblas on Linux was obtained?
> Was it built from source?
> Has it come pre-packaged?
> If so, can you find out details of its build configuration?
> 
> Thanks!

I think the best test would be to run a statically compiled linux binary on 
FreeBSD. That way the compiler settings are exactly the same.

- Pieter
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-14 Thread Andriy Gapon
on 14/04/2010 02:21 Maho NAKATA said the following:
> 2. install ports/math/gotoblas (manual download required)
>  make install 


Do you know how gotoblas on Linux was obtained?
Was it built from source?
Has it come pre-packaged?
If so, can you find out details of its build configuration?

Thanks!
-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-13 Thread Maho NAKATA
Hi all, thanks for showing interest in this issue.

I uploaded my test code so that you can test on your PC. 
Following is the instruction.

1. download my source codes.
http://people.freebsd.org/~maho/dgemm/Makefile
http://people.freebsd.org/~maho/dgemm/dgemm.cpp
check md5.
% md5 Makefile dgemm.cpp 
MD5 (Makefile) = b408ab1e1f5bf8b923cae5ec9f9f0f07
MD5 (dgemm.cpp) = 0d774a456a665429c67c2b07fd24c64c

2. install ports/math/gotoblas (manual download required)
 make install 

3. compile dgemm.cpp
just type make
% make
g++44 -pthread -static -O2 -o dgemm dgemm.cpp  -L/usr/local/lib -lgoto2p
g++44 -pthread -static -O2 -o dgemm_ref dgemm.cpp  -L/usr/local/lib -lblas 
-lgfortran

4. run dgemm. 
% ./dgemm
n: 3000
time : 134.648208 or 16.910525 
Mflops : 31943.419695
n: 3100
time : 148.122279 or 18.615284 
Mflops : 32017.357408
n: 3200
time : 162.45 or 20.430651 
Mflops : 32087.318295
n: 3300
time : 178.497079 or 22.446093 
Mflops : 32030.420499
n: 3400
time : 195.550715 or 24.586152 
Mflops : 31981.873273
n: 3500
time : 213.403379 or 26.825058 
Mflops : 31975.513363
n: 3600
...
above output is on Core i7 920 (2.66GHz; TurboBoost on)

Thanks
-- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/ 
   Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-13 Thread Andriy Gapon
on 13/04/2010 02:33 Maho NAKATA said the following:
> From: Andriy Gapon 
>> Another question is what compilers (what versions of GCC) were used on both
>> system to compile the program?
> 
> Hi
> 
> on Ubuntu $ gcc -v Using built-in specs. Target: x86_64-linux-gnu Configured
> with: ../src/configure -v --with-pkgversion='Ubuntu 4.4.1-4ubuntu9'
> --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs
> --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --enable-shared
> --enable-multiarch --enable-linker-build-id --with-system-zlib
> --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
> --with-gxx-include-dir=/usr/include/c++/4.4 --program-suffix=-4.4
> --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc
> --disable-werror --with-arch-32=i486 --with-tune=generic
> --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu
> --target=x86_64-linux-gnu Thread model: posix gcc version 4.4.1 (Ubuntu
> 4.4.1-4ubuntu9)
> 
> on FreeBSD % gcc44 -v Using built-in specs. Target: x86_64-portbld-freebsd8.0
>  Configured with: ./../gcc-4.4-20100330/configure --disable-nls
> --libdir=/usr/local/lib/gcc44 --libexecdir=/usr/local/libexec/gcc44
> --program-suffix=44 --with-as=/usr/local/bin/as --with-gmp=/usr/local
> --with-gxx-include-dir=/usr/local/lib/gcc44/include/c++/
> --with-ld=/usr/local/bin/ld --with-libiconv-prefix=/usr/local
> --with-system-zlib --disable-libgcj --prefix=/usr/local
> --mandir=/usr/local/man --infodir=/usr/local/info/gcc44
> --build=x86_64-portbld-freebsd8.0 Thread model: posix gcc version 4.4.4
> 20100330 (prerelease) (GCC)

Is this what was used to compile the code in hot path (the code that performs
all the actual calculations)?  The answer is not obvious.
GCC 4.4 is known to produce better code for modern CPUs, partially because it
has knowledge of recently introduced instructions.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-12 Thread Alan Cox
On Tue, Apr 13, 2010 at 12:35 AM, Andrew Snow  wrote:

>
> The statements about the scheduler flipping between cores is also somewhat
> false, ULE does the right thing now for long-running computational threads.
>
> Furthermore, I can't see how a Gflops benchmark which fits in the CPU cache
> has anything to do with the memory architecture of the operating system.
>
>
It can.  Search the web for descriptions of page coloring.  Roughly
speaking, if your cache is physically indexed, the way in which the virtual
memory system allocates physical pages to virtual addresses can affect
whether or not the cache is fully utilized.  In a pathological case, those
physical pages that your application touches reside in the same part of the
cache and consequently you suffer frequent conflict misses.  Meanwhile, the
other parts of the cache go unused.  Page coloring creates a predictable
mapping between virtual and physical addresses so that a carefully written
application can avoid the pathological case.

Our support for superpages has the same effect.

Alan
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-12 Thread Andrew Snow


The statements about the scheduler flipping between cores is also 
somewhat false, ULE does the right thing now for long-running 
computational threads.


Furthermore, I can't see how a Gflops benchmark which fits in the CPU 
cache has anything to do with the memory architecture of the operating 
system.


I assume to reach these results the benchmark was multi-threaded, and so 
I think I'd start by looking at the scheduler.


Before that I'd probably look at the libraries, how they were compiled, 
differences in the compiler etc.


- Andrew

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-12 Thread Alan Cox
On Sun, Apr 11, 2010 at 11:12 PM, Maho NAKATA  wrote:

> Hi FreeBSD developers,
> [the original article in Japanese can be found at
> http://blog.goo.ne.jp/nakatamaho/e/b5f6fbc3cc6e1ac4947463eb1ca4eb0a ]
>
> *Abstract*
> I compared the peak performance of FreeBSD 8.0/amd64 and Ubuntu 9.10 amd64
> using dgemm
> (a linear algebra routine, matrix-matrix multiplication).
> I obtained only 70% of theoretical peak performance on FreeBSD 8/amd64 and
> almost 95% on Ubuntu 9.10 /amd64. I'm really disappointed.
>
> *Introduction*
> I'm a friend of Gotoh Kazushige, the principal developers of GotoBLAS. He
> told me that
> FreeBSD is not suitable OS for scientific computing or high performance
> computing. He says
> (in Japanese and my translation):
>
> > I guess FreeBSD does page coloring, but I don't think FreeBSD considers
> very large cache
> > size which recent CPU has. Support of a very large cache on Linux is
> still not very will
> > sophisticated, but on *BSDs, its worst; they uses too fine memory
> allocation method,
> > so we cannot expect large continuous physical memory allocation.
>

These statements about FreeBSD's memory management are wrong, or at least
outdated.  FreeBSD is very likely to allocate physical memory in contiguous
chunks to your memory-hungry application even if automatic superpage
promotion does not occur.

You should refer your friend to my paper at
http://www.usenix.org/events/osdi02/tech/full_papers/navarro/navarro_html/and
tell him that FreeBSD >= 7.2 implements a variation on what that paper
describes.

Regards,
Alan
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-12 Thread Michael Poole
Maho NAKATA writes:

> From: Michael Poole 
> Subject: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, 
> Corei7 920
> Date: Mon, 12 Apr 2010 10:06:55 -0400
>
>> Nakata-san's theoretical performance numbers assume 4 to 4.2 operations
>> per core per cycle at the nominal (2.66 GHz, non-TurboBoost) clock rate.
>> (DGEMM is double precision, but I am not familiar enough with scientific
>> computing or with the Nehalem implementation of SSE to know why it is
>> four operations per cycle rather than two -- is it because double
>> precision counts as two FLOPs or is it because of multiple issue?)
>> TurboBoost runs up to 2.93 GHz on this CPU, so it doesn't fit either the
>> theoretical peak performance or the performance discrepancy very well.
>
> Hi Michael,
> I read http://www.intel.com/support/processors/sb/cs-023143.htm
> and TurboBoost on 920 is 2.80GHz.

Ah.  I was looking at http://ark.intel.com/Product.aspx?id=37147 .
Given a 2.80 GHz TurboBoost, the 44.8 GFLOPS theoretical performance
number makes sense.

I think the more important point is that TurboBoost on this CPU gives at
most a 10% speedup, so it cannot explain the 25% performance difference.

Michael
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-12 Thread Maho NAKATA
Hi Bruce, many thanks.

I like FreeBSD, esp. ports, since I'm have been a ports committer for 8 years,
so I'll do what I can do...First step might be reproducible results and provide
better analysis for ports/math/ports/gotoblas.

thanks
-- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/ 
   Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-12 Thread Maho NAKATA
Hi,

Many thanks for interested in.
I used following program to major the FLOPS. I'll provide more in details.
you many need  but you can change dgemm_f77 to something else to link
agianst GotoBLAS (ports/math/gotoblas). I think you can use math/atlas but
it takes too long time to compile...
---
#include 
#include 
#include 
#include 
#include 

#define F77_FUNC(name,NAME) name ## _ 
#include 

#define MAXLOOP 10

unsigned long long microseconds()
{
rusage  t;
timeval tv;
getrusage( RUSAGE_SELF, &t );
tv = t.ru_utime;
return ((unsigned long long)tv.tv_sec)*100 + tv.tv_usec;
}

double gettimeofday_sec()
{
struct timeval tv;
gettimeofday(&tv, NULL);
return tv.tv_sec + (double)tv.tv_usec*1e-6;
}

int
main()
{
int n;
int incx = 1, incy = 1;
double alpha = 3.14, beta = 2.717;
double dgemmtime, t1, t2, t_1, t_2;

 for (n = 3000 ; n < 1; n=n+100) {
printf("n: %d\n", (int)n);
double *A = new double[n*n];
double *B = new double[n*n];
double *C = new double[n*n];
for (int i = 0; i < n; i++) {
  for (int j = 0; j < n; j++) {
A[i*n+j] = i * j + 1;
B[i*n+j] = (i+1) * (j+1) + 1;
C[i*n+j] = (i+1) - (j+1) + 1;
  }
}
t1 = (double)microseconds(); t_1 = gettimeofday_sec();
  for (int p = 0 ; p < MAXLOOP; p++ ){
   dgemm_f77("n", "n", &n, &n, &n, &alpha, A, &n, B, &n, &beta, C, &n);
  }
t2 = (double)microseconds(); t_2 = gettimeofday_sec();
 // dgemmtime = (t2 - t1) * 1e-6;
dgemmtime = (t_2 - t_1);
printf("time : %lf or %lf \n", (t2 - t1) * 1e-6, t_2 - t_1);
printf("Mflops : %lf\n", ( 2.0 * (double)n * (double)n * (double)n + 
2.0 * (double)n* (double)n )* MAXLOOP / dgemmtime / (1000*1000) );
delete[]C;
delete[]B;
delete[]A;
}
}


-- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/ 
   Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-12 Thread Maho NAKATA
Hi Andriy and Jeremy

In my case,
% sysctl vm.pmap.pg_ps_enabled
vm.pmap.pg_ps_enabled: 1

thanks a lot!

From: Jeremy Chadwick 
Subject: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, 
Corei7 920
Date: Mon, 12 Apr 2010 08:00:23 -0700

> On Mon, Apr 12, 2010 at 05:41:35PM +0300, Andriy Gapon wrote:
>> Perhaps, he talks about support of large pages (2M) and related improvements 
>> in
>> TLB performance.  If so, he (and you) may read about 'superpages' feature of 
>> FreeBSD.
>> I am not sure if it is enabled by default in 8.0, you can check 
>> vm.pmap.pg_ps_enabled.
> 
> On 8.0-RELEASE and later, they are.  Line 183:
> 
> http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/amd64/amd64/pmap.c?annotate=1.667.2.12
> 
> Commit where they got enabled by default (approx. 16 months ago):
> 
> http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/amd64/amd64/pmap.c#rev1.646
> 
> -- 
> | Jeremy Chadwick   j...@parodius.com |
> | Parodius Networking   http://www.parodius.com/ |
> | UNIX Systems Administrator  Mountain View, CA, USA |
> | Making life hard for others since 1977.  PGP: 4BD6C0CB |
> 
> ___
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
> 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-12 Thread Maho NAKATA
From: Andriy Gapon 
Subject: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, 
Corei7 920
Date: Mon, 12 Apr 2010 17:49:24 +0300

> on 12/04/2010 17:41 Andriy Gapon said the following:
>> It would also be get good to learn more about your program.
>> How much memory does it typically use, how does it allocate it?
>> Is it single-threaded or not?  If not, how many threads does it have and 
>> what do
>> they do, how do they communicate?
> 
> Another question is what compilers (what versions of GCC) were used on both 
> system
> to compile the program?

Hi

on Ubuntu
$ gcc -v
Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.4.1-4ubuntu9' 
--with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs 
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --enable-shared 
--enable-multiarch --enable-linker-build-id --with-system-zlib 
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix 
--with-gxx-include-dir=/usr/include/c++/4.4 --program-suffix=-4.4 --enable-nls 
--enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --disable-werror 
--with-arch-32=i486 --with-tune=generic --enable-checking=release 
--build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.4.1 (Ubuntu 4.4.1-4ubuntu9) 

on FreeBSD
% gcc44 -v
Using built-in specs.
Target: x86_64-portbld-freebsd8.0
Configured with: ./../gcc-4.4-20100330/configure --disable-nls 
--libdir=/usr/local/lib/gcc44 --libexecdir=/usr/local/libexec/gcc44 
--program-suffix=44 --with-as=/usr/local/bin/as --with-gmp=/usr/local 
--with-gxx-include-dir=/usr/local/lib/gcc44/include/c++/ 
--with-ld=/usr/local/bin/ld --with-libiconv-prefix=/usr/local 
--with-system-zlib --disable-libgcj --prefix=/usr/local --mandir=/usr/local/man 
--infodir=/usr/local/info/gcc44 --build=x86_64-portbld-freebsd8.0
Thread model: posix
gcc version 4.4.4 20100330 (prerelease) (GCC) 

thanks
-- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/ 
   Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-12 Thread Maho NAKATA
From: Michael Poole 
Subject: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, 
Corei7 920
Date: Mon, 12 Apr 2010 10:06:55 -0400

> Nakata-san's theoretical performance numbers assume 4 to 4.2 operations
> per core per cycle at the nominal (2.66 GHz, non-TurboBoost) clock rate.
> (DGEMM is double precision, but I am not familiar enough with scientific
> computing or with the Nehalem implementation of SSE to know why it is
> four operations per cycle rather than two -- is it because double
> precision counts as two FLOPs or is it because of multiple issue?)
> TurboBoost runs up to 2.93 GHz on this CPU, so it doesn't fit either the
> theoretical peak performance or the performance discrepancy very well.

Hi Michael,
I read http://www.intel.com/support/processors/sb/cs-023143.htm
and TurboBoost on 920 is 2.80GHz.
> why it is four operations per cycle rather than two

It's bit strane to me as well. but I did dgemm operation with m=k=n case and 
in this case, flop count would become 2n^3 + 2n^2 (even 2n^3 is okay). 

thanks
-- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/ 
   Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-12 Thread Maho NAKATA
Hi Antony

I think this is not the case. I tested TurboBoost on/off on Ubuntu, GotoBLAS
achieved 95% of theoretical perfomance for both cases.

cf. http://www.intel.com/support/processors/sb/cs-023143.htm
and http://blog.goo.ne.jp/nakatamaho/e/86c0f4ac529fd5b530454ed795e6b466 
(written in Japanese, tho)
Thanks

From: Antony Mawer 
Subject: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, 
Corei7 920
Date: Mon, 12 Apr 2010 23:58:17 +1000

> This may well be the same sort of issue that was discussed in this thread 
> here:
> 
> http://lists.freebsd.org/pipermail/freebsd-hackers/2010-March/031004.html
> 
> In short, the Core i7 CPUs have a feature called "TurboBoost" where
> the clock speed of one or more cores is boosted when other cores are
> idle and in a C2 or C3 sleep status ... if the appropriate power
> saving mode isn't active on the system (which I don't think FreeBSD
> does by default?), the idle cores are never put into the appropriate
> power saving state, and as a result TurboBoost never kicks in...
> 
> It _may_ be that Ubuntu configures this correctly whereas FreeBSD does
> not (out of the box)?
> 
> Of course it may be something else entirely, but worth checking out...
> 
> --Antony
> 
> On Mon, Apr 12, 2010 at 7:31 PM, Adrian Chadd  wrote:
>> Of course, what would be helpful is actually figuring out what is
>> going on rather than some conjecture. :)
>>
>> With what he said, tweaking memory allocation under FreeBSD and/or
>> linux would change the performance characteristics and either validate
>> or disprove his assumptions?
>>
>>
>> Adrian
>>
>> On 12 April 2010 12:12, Maho NAKATA  wrote:
>>> Hi FreeBSD developers,
>>> [the original article in Japanese can be found at
>>> http://blog.goo.ne.jp/nakatamaho/e/b5f6fbc3cc6e1ac4947463eb1ca4eb0a ]
>>>
>>> *Abstract*
>>> I compared the peak performance of FreeBSD 8.0/amd64 and Ubuntu 9.10 amd64 
>>> using dgemm
>>> (a linear algebra routine, matrix-matrix multiplication).
>>> I obtained only 70% of theoretical peak performance on FreeBSD 8/amd64 and
>>> almost 95% on Ubuntu 9.10 /amd64. I'm really disappointed.
>>>
>>> *Introduction*
>>> I'm a friend of Gotoh Kazushige, the principal developers of GotoBLAS. He 
>>> told me that
>>> FreeBSD is not suitable OS for scientific computing or high performance 
>>> computing. He says
>>> (in Japanese and my translation):
>>>
>>>> I guess FreeBSD does page coloring, but I don't think FreeBSD considers 
>>>> very large cache
>>>> size which recent CPU has. Support of a very large cache on Linux is still 
>>>> not very will
>>>> sophisticated, but on *BSDs, its worst; they uses too fine memory 
>>>> allocation method,
>>>> so we cannot expect large continuous physical memory allocation.
>>>> Moreover, process scheduling is not so nice as *BSD employs an algorithm 
>>>> that
>>>> changes physical CPUs in turn instead of allocating one core for such kind 
>>>> of jobs.
>>>> Take your own benchmark, and you'll see..
>>>
>>> *Result*
>>> Machine: Core i7 920 (42.56-44.8Gflops) / DDR3 1066
>>> OS: FreeBSD 8.0/amd64 and Ubuntu 9.10
>>> GotoBLAS2: 1.13
>>>
>>> dgemm result
>>> OS      : FLOPS           : percent in peak
>>> FreeBSD : 32.0 GFlops     : 71%
>>> Ubuntu  : 42.0-42.7GFlops : 93.8%-95.3%
>>>
>>> Thanks,
>>> -- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/
>>>   Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt
>>>
>>>
>>>
>>>
>>> ___
>>> freebsd-stable@freebsd.org mailing list
>>> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
>>> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
>>>
>> ___
>> freebsd-stable@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
>> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
>>
> ___
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
> 
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-12 Thread Maho NAKATA
Hi Bruce,

From: Bruce Simpson 
Subject: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, 
Corei7 920
Date: Mon, 12 Apr 2010 10:49:14 +0100

> So, where's the profiling to discover why this is the case?
Ok I'll provide better documentation so that everyone can test it very clearly.
(may take some time...)

> Also I'm not clear on what constitutes 'theoretical peak performance'
> here or how it is being calculated. So figures like these come across
> as unscientific.

Core i7 920 (2.66GHz) constitutes four cores. each core has four floating point 
operators.
thus; 2.66GHz x 4 x 4 = 42.56Gflops
cf. http://www.intel.com/support/processors/sb/cs-023143.htm

> I'm sure this is something which can be resolved if someone sits down,
> profiles the app, and makes the necessary adjustments
> (e.g. pthread_setaffinity_np()) to configure CPU affinity, if the lack
> of it is pessimizing your friend's app.
might be. we run on the same machine.

> The PMC framework is rapidly maturing, and you can use KCacheGrind
> with it to visualize context switch overhead.
> 
> But I think it's expecting a bit much to post informal results to
> -stable, in an expectation of something other thaninformal suggestions
> of what may help someone's maths-intensive application.

BLAS is a basic linear algebra package which is used many applications.
It is also used for top500 http://www.top500.org/ 
cf. http://www.top500.org/project/introduction
via LINPACK. dgemm is LEVEL 3 BLAS, which is a very good for common PCs
as calculation is CPU intensive.

> If there are performance issues, then reproducible results are needed,
> as well as some basic profiling effort of the system elements
> involved, before people could say anything either way, or offer
> further help.
again, I'll provide better documentation so that everyone can test it very 
clearly.
(may take some time...)

thanks,
-- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/ 
   Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-12 Thread Bruce Simpson

Hi all,

There's a port archivers/pbzip2, and I am inclined to believe this is a 
good benchmark for multi-core performance in real-world usage (with an 
appropriate input data set).


BZIP2 is a compression algorithm which is readily applicable to 
multicore, because of the nature in which its workload may be partioned 
amongst multiple CPU cores. It block-sorts, and it can compress long 
runs of input data independently of other CPU threads.


When I used PBZIP2 informally back in January, before advising on 
FreeBSD/Xen, I saw largely the results I'd expect to see from such a 
workload, and didn't encounter pessimization of benchmark figures. 
Informal tests were performed on 8-STABLE at that time.


The OP may well be looking for Newton-Raphson approximations, to the 
derivatives involved in his friend's linear algebra system. The point is 
that PBZIP2 would also exercise context switches in a real-life workload.


I'd be concerned, as anyone else would be, about benchmarks which 
apparently challenge FreeBSD's ability to tackle significant 
mathematical workloads. But from what little I understand, from speaking 
to David Schultz and others who have been involved with FreeBSD's 
floating point performance, on a scientific basis -- without a 
scientifically reproducible experiment, I don't see a problem.


Obviously, I am concerned that Nakata-san observes what he regards to be 
a problem, and would like to help any way I can.


cheers,
BMS




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-12 Thread Jeremy Chadwick
On Mon, Apr 12, 2010 at 05:41:35PM +0300, Andriy Gapon wrote:
> Perhaps, he talks about support of large pages (2M) and related improvements 
> in
> TLB performance.  If so, he (and you) may read about 'superpages' feature of 
> FreeBSD.
> I am not sure if it is enabled by default in 8.0, you can check 
> vm.pmap.pg_ps_enabled.

On 8.0-RELEASE and later, they are.  Line 183:

http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/amd64/amd64/pmap.c?annotate=1.667.2.12

Commit where they got enabled by default (approx. 16 months ago):

http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/amd64/amd64/pmap.c#rev1.646

-- 
| Jeremy Chadwick   j...@parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-12 Thread Andriy Gapon
on 12/04/2010 17:41 Andriy Gapon said the following:
> It would also be get good to learn more about your program.
> How much memory does it typically use, how does it allocate it?
> Is it single-threaded or not?  If not, how many threads does it have and what 
> do
> they do, how do they communicate?

Another question is what compilers (what versions of GCC) were used on both 
system
to compile the program?

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-12 Thread Andriy Gapon
on 12/04/2010 07:12 Maho NAKATA said the following:
> Hi FreeBSD developers,
> [the original article in Japanese can be found at
> http://blog.goo.ne.jp/nakatamaho/e/b5f6fbc3cc6e1ac4947463eb1ca4eb0a ] 
> 
> *Abstract*
> I compared the peak performance of FreeBSD 8.0/amd64 and Ubuntu 9.10 amd64 
> using dgemm
> (a linear algebra routine, matrix-matrix multiplication).
> I obtained only 70% of theoretical peak performance on FreeBSD 8/amd64 and
> almost 95% on Ubuntu 9.10 /amd64. I'm really disappointed.

Sorry about that, but more important question (for us) is: are you willing to 
help
us improve in addition to reporting your results?

> *Introduction*
> I'm a friend of Gotoh Kazushige, the principal developers of GotoBLAS. He 
> told me that
> FreeBSD is not suitable OS for scientific computing or high performance 
> computing. He says
> (in Japanese and my translation):
> 
>> I guess FreeBSD does page coloring, but I don't think FreeBSD considers very 
>> large cache
>> size which recent CPU has.

AFAIK, recent FreeBSD doesn't use page coloring anymore.

>> Support of a very large cache on Linux is still not very will
>> sophisticated, but on *BSDs, its worst; they uses too fine memory allocation 
>> method, 
>> so we cannot expect large continuous physical memory allocation.

Can your friend provide more explanation about these points in technical terms?
E.g. what kind of support, in his opinion, is needed for very large caches?
Why, in his opinion, the memory needs to be physically contiguous?

Perhaps, he talks about support of large pages (2M) and related improvements in
TLB performance.  If so, he (and you) may read about 'superpages' feature of 
FreeBSD.
I am not sure if it is enabled by default in 8.0, you can check 
vm.pmap.pg_ps_enabled.

>> Moreover, process scheduling is not so nice as *BSD employs an algorithm that
>> changes physical CPUs in turn instead of allocating one core for such kind 
>> of jobs.
>> Take your own benchmark, and you'll see..

Here I can only add an anecdotal 'me too'.
Sometimes I run single-threaded high-cpu programs like ffmpeg transcoding on
otherwise idle system (a bunch of system daemons in background).
And I see that the cpu-consuming process frequently goes back and forth between 
my
two cores.  CPU user loads on the cores are something like 60% vs 40%.
My expectations were that the process would mostly run on one core while the 
rest
of the threads would mostly run on the other.
I am not sure if that core switching really hurts performance and if there is
something wrong about it.  But somehow it seems 'counter-intuitive'.

> *Result*
> Machine: Core i7 920 (42.56-44.8Gflops) / DDR3 1066
> OS: FreeBSD 8.0/amd64 and Ubuntu 9.10
> GotoBLAS2: 1.13
> 
> dgemm result
> OS  : FLOPS   : percent in peak
> FreeBSD : 32.0 GFlops : 71%
> Ubuntu  : 42.0-42.7GFlops : 93.8%-95.3%

It would also be get good to learn more about your program.
How much memory does it typically use, how does it allocate it?
Is it single-threaded or not?  If not, how many threads does it have and what do
they do, how do they communicate?

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-12 Thread Michael Poole
Antony Mawer writes:

> This may well be the same sort of issue that was discussed in this thread 
> here:
>
> http://lists.freebsd.org/pipermail/freebsd-hackers/2010-March/031004.html
>
> In short, the Core i7 CPUs have a feature called "TurboBoost" where
> the clock speed of one or more cores is boosted when other cores are
> idle and in a C2 or C3 sleep status ... if the appropriate power
> saving mode isn't active on the system (which I don't think FreeBSD
> does by default?), the idle cores are never put into the appropriate
> power saving state, and as a result TurboBoost never kicks in...
>
> It _may_ be that Ubuntu configures this correctly whereas FreeBSD does
> not (out of the box)?
>
> Of course it may be something else entirely, but worth checking out...

Nakata-san's theoretical performance numbers assume 4 to 4.2 operations
per core per cycle at the nominal (2.66 GHz, non-TurboBoost) clock rate.
(DGEMM is double precision, but I am not familiar enough with scientific
computing or with the Nehalem implementation of SSE to know why it is
four operations per cycle rather than two -- is it because double
precision counts as two FLOPs or is it because of multiple issue?)
TurboBoost runs up to 2.93 GHz on this CPU, so it doesn't fit either the
theoretical peak performance or the performance discrepancy very well.

Michael Poole
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-12 Thread Antony Mawer
This may well be the same sort of issue that was discussed in this thread here:

http://lists.freebsd.org/pipermail/freebsd-hackers/2010-March/031004.html

In short, the Core i7 CPUs have a feature called "TurboBoost" where
the clock speed of one or more cores is boosted when other cores are
idle and in a C2 or C3 sleep status ... if the appropriate power
saving mode isn't active on the system (which I don't think FreeBSD
does by default?), the idle cores are never put into the appropriate
power saving state, and as a result TurboBoost never kicks in...

It _may_ be that Ubuntu configures this correctly whereas FreeBSD does
not (out of the box)?

Of course it may be something else entirely, but worth checking out...

--Antony

On Mon, Apr 12, 2010 at 7:31 PM, Adrian Chadd  wrote:
> Of course, what would be helpful is actually figuring out what is
> going on rather than some conjecture. :)
>
> With what he said, tweaking memory allocation under FreeBSD and/or
> linux would change the performance characteristics and either validate
> or disprove his assumptions?
>
>
> Adrian
>
> On 12 April 2010 12:12, Maho NAKATA  wrote:
>> Hi FreeBSD developers,
>> [the original article in Japanese can be found at
>> http://blog.goo.ne.jp/nakatamaho/e/b5f6fbc3cc6e1ac4947463eb1ca4eb0a ]
>>
>> *Abstract*
>> I compared the peak performance of FreeBSD 8.0/amd64 and Ubuntu 9.10 amd64 
>> using dgemm
>> (a linear algebra routine, matrix-matrix multiplication).
>> I obtained only 70% of theoretical peak performance on FreeBSD 8/amd64 and
>> almost 95% on Ubuntu 9.10 /amd64. I'm really disappointed.
>>
>> *Introduction*
>> I'm a friend of Gotoh Kazushige, the principal developers of GotoBLAS. He 
>> told me that
>> FreeBSD is not suitable OS for scientific computing or high performance 
>> computing. He says
>> (in Japanese and my translation):
>>
>>> I guess FreeBSD does page coloring, but I don't think FreeBSD considers 
>>> very large cache
>>> size which recent CPU has. Support of a very large cache on Linux is still 
>>> not very will
>>> sophisticated, but on *BSDs, its worst; they uses too fine memory 
>>> allocation method,
>>> so we cannot expect large continuous physical memory allocation.
>>> Moreover, process scheduling is not so nice as *BSD employs an algorithm 
>>> that
>>> changes physical CPUs in turn instead of allocating one core for such kind 
>>> of jobs.
>>> Take your own benchmark, and you'll see..
>>
>> *Result*
>> Machine: Core i7 920 (42.56-44.8Gflops) / DDR3 1066
>> OS: FreeBSD 8.0/amd64 and Ubuntu 9.10
>> GotoBLAS2: 1.13
>>
>> dgemm result
>> OS      : FLOPS           : percent in peak
>> FreeBSD : 32.0 GFlops     : 71%
>> Ubuntu  : 42.0-42.7GFlops : 93.8%-95.3%
>>
>> Thanks,
>> -- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/
>>   Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt
>>
>>
>>
>>
>> ___
>> freebsd-stable@freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
>> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
>>
> ___
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
>
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-12 Thread Adrian Chadd
Of course, what would be helpful is actually figuring out what is
going on rather than some conjecture. :)

With what he said, tweaking memory allocation under FreeBSD and/or
linux would change the performance characteristics and either validate
or disprove his assumptions?


Adrian

On 12 April 2010 12:12, Maho NAKATA  wrote:
> Hi FreeBSD developers,
> [the original article in Japanese can be found at
> http://blog.goo.ne.jp/nakatamaho/e/b5f6fbc3cc6e1ac4947463eb1ca4eb0a ]
>
> *Abstract*
> I compared the peak performance of FreeBSD 8.0/amd64 and Ubuntu 9.10 amd64 
> using dgemm
> (a linear algebra routine, matrix-matrix multiplication).
> I obtained only 70% of theoretical peak performance on FreeBSD 8/amd64 and
> almost 95% on Ubuntu 9.10 /amd64. I'm really disappointed.
>
> *Introduction*
> I'm a friend of Gotoh Kazushige, the principal developers of GotoBLAS. He 
> told me that
> FreeBSD is not suitable OS for scientific computing or high performance 
> computing. He says
> (in Japanese and my translation):
>
>> I guess FreeBSD does page coloring, but I don't think FreeBSD considers very 
>> large cache
>> size which recent CPU has. Support of a very large cache on Linux is still 
>> not very will
>> sophisticated, but on *BSDs, its worst; they uses too fine memory allocation 
>> method,
>> so we cannot expect large continuous physical memory allocation.
>> Moreover, process scheduling is not so nice as *BSD employs an algorithm that
>> changes physical CPUs in turn instead of allocating one core for such kind 
>> of jobs.
>> Take your own benchmark, and you'll see..
>
> *Result*
> Machine: Core i7 920 (42.56-44.8Gflops) / DDR3 1066
> OS: FreeBSD 8.0/amd64 and Ubuntu 9.10
> GotoBLAS2: 1.13
>
> dgemm result
> OS      : FLOPS           : percent in peak
> FreeBSD : 32.0 GFlops     : 71%
> Ubuntu  : 42.0-42.7GFlops : 93.8%-95.3%
>
> Thanks,
> -- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/
>   Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt
>
>
>
>
> ___
> freebsd-stable@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
>
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-12 Thread Bruce Simpson

On 04/12/10 05:12, Maho NAKATA wrote:

*Abstract*
I compared the peak performance of FreeBSD 8.0/amd64 and Ubuntu 9.10 amd64 
using dgemm
(a linear algebra routine, matrix-matrix multiplication).
I obtained only 70% of theoretical peak performance on FreeBSD 8/amd64 and
almost 95% on Ubuntu 9.10 /amd64. I'm really disappointed.
   


So, where's the profiling to discover why this is the case?

Also I'm not clear on what constitutes 'theoretical peak performance' 
here or how it is being calculated. So figures like these come across as 
unscientific.


I'm sure this is something which can be resolved if someone sits down, 
profiles the app, and makes the necessary adjustments (e.g. 
pthread_setaffinity_np()) to configure CPU affinity, if the lack of it 
is pessimizing your friend's app.


The PMC framework is rapidly maturing, and you can use KCacheGrind with 
it to visualize context switch overhead.


But I think it's expecting a bit much to post informal results to 
-stable, in an expectation of something other thaninformal suggestions 
of what may help someone's maths-intensive application.


If there are performance issues, then reproducible results are needed, 
as well as some basic profiling effort of the system elements involved, 
before people could say anything either way, or offer further help.


cheers,
BMS
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"


Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-11 Thread Garrett Cooper
On Sun, Apr 11, 2010 at 9:12 PM, Maho NAKATA  wrote:
> Hi FreeBSD developers,
> [the original article in Japanese can be found at
> http://blog.goo.ne.jp/nakatamaho/e/b5f6fbc3cc6e1ac4947463eb1ca4eb0a ]
>
> *Abstract*
> I compared the peak performance of FreeBSD 8.0/amd64 and Ubuntu 9.10 amd64 
> using dgemm
> (a linear algebra routine, matrix-matrix multiplication).
> I obtained only 70% of theoretical peak performance on FreeBSD 8/amd64 and
> almost 95% on Ubuntu 9.10 /amd64. I'm really disappointed.
>
> *Introduction*
> I'm a friend of Gotoh Kazushige, the principal developers of GotoBLAS. He 
> told me that
> FreeBSD is not suitable OS for scientific computing or high performance 
> computing. He says
> (in Japanese and my translation):
>
>> I guess FreeBSD does page coloring, but I don't think FreeBSD considers very 
>> large cache
>> size which recent CPU has. Support of a very large cache on Linux is still 
>> not very will
>> sophisticated, but on *BSDs, its worst; they uses too fine memory allocation 
>> method,
>> so we cannot expect large continuous physical memory allocation.
>> Moreover, process scheduling is not so nice as *BSD employs an algorithm that
>> changes physical CPUs in turn instead of allocating one core for such kind 
>> of jobs.
>> Take your own benchmark, and you'll see..
>
> *Result*
> Machine: Core i7 920 (42.56-44.8Gflops) / DDR3 1066
> OS: FreeBSD 8.0/amd64 and Ubuntu 9.10
> GotoBLAS2: 1.13
>
> dgemm result
> OS      : FLOPS           : percent in peak
> FreeBSD : 32.0 GFlops     : 71%
> Ubuntu  : 42.0-42.7GFlops : 93.8%-95.3%

I'm not sure if this is the exact issue, but it might be a point
of reference worth investigating:
http://lists.freebsd.org/pipermail/freebsd-hackers/2010-March/031004.html
Thanks,
-Garrett
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"