Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
on 15/04/2010 16:23 Adam Vande More said the following: > Is is possible to add a tunable to the scheduler for it's aggressiveness > in switching cores? No idea; not a scheduler person. -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
On Thu, Apr 15, 2010 at 3:54 AM, Andriy Gapon wrote: > This is a good point. > But on the other hand, it means that our scheduler doesn't do a perfect job > here. BTW, I use ULE. > My observation is that when a number of CPU-intensive long running > processes is > less than or equal to number of cores, then the processes tend to stay on > the > same cores for a long time. > But if the number of the processes is greater, then they seem to jump from > core > to core a lot. > But I am not sure what would be an optimal strategy for that case. If we > try to > keep some lucky processes on the same core, then cpu time might be shared > unfairly. Shuffling cores provides more fairness, but can hurt total > performance. > Is is possible to add a tunable to the scheduler for it's aggressiveness in switching cores? -- Adam Vande More ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
on 14/04/2010 20:47 Adam Vande More said the following: > I'm no expert Andriy, but it seems like if gotoblas > implemented some of the FreeBSD optimizations then we'd be in the same > ballpark. This is a good point. But on the other hand, it means that our scheduler doesn't do a perfect job here. BTW, I use ULE. My observation is that when a number of CPU-intensive long running processes is less than or equal to number of cores, then the processes tend to stay on the same cores for a long time. But if the number of the processes is greater, then they seem to jump from core to core a lot. But I am not sure what would be an optimal strategy for that case. If we try to keep some lucky processes on the same core, then cpu time might be shared unfairly. Shuffling cores provides more fairness, but can hurt total performance. -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: HyperThreading makes worse to me (was Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920)
May I make a suggestion? Would you mind creating a shared google spreadsheet with your testing results and a shared google document with the test setup? I think having the data in an easily represented, easily shared medium would be beneficial to everyone. Adrian On 15 April 2010 08:46, Maho NAKATA wrote: > Hi Andry and Adam > > My test again. No desktop, etc. I just run dgemm. > Contrary to Adam's result, Hyper Threading makes the performance worse. > all tests are done on Core i7 920 @ 2.67GHz. (TurboBoost @2.8GHz) > > Turbo Boost off, Hyper threading off: 82% (35GFlops) [1] > Turbo Boost off, Hyper threading off: 72% (30.5GFlops) [2] > > Turbo Boost on, Hyper threading on: 71% (32GFlops) [3] > Turbo Boost off, Hyper threading off: 84-89% (38-40GFlops) [4] > > ---my system--- > CPU: Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz (2683.44-MHz K8-class > CPU) > Origin = "GenuineIntel" Id = 0x106a5 Stepping = 5 > Features=0xbfebfbff > Features2=0x98e3bd > AMD Features=0x28100800 > AMD Features2=0x1 > TSC: P-state invariant > real memory = 12884901888 (12288 MB) > avail memory = 12387717120 (11813 MB) > ACPI APIC Table: <110909 APIC1026> > FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs > FreeBSD/SMP: 1 package(s) x 4 core(s) > ---my system--- > > ---DETAILS--- > [1] > % ./dgemm > n: 3000 > time : 57.666717 or 16.339074 > Mflops : 33060.624827 > n: 3100 > time : 61.502677 or 16.597376 > Mflops : 35910.025544 > n: 3200 > time : 69.075401 or 19.199833 > Mflops : 34144.297133 > n: 3300 > time : 73.699540 or 19.633594 > Mflops : 36618.756539 > n: 3400 > time : 82.256194 or 22.373651 > Mflops : 35144.518837 > n: 3500 > time : 88.975662 or 24.118761 > Mflops : 35563.394249 > n: 3600 > time : 96.436652 or 26.027588 > Mflops : 35861.148385 > n: 3700 > [2] > % ./dgemm > n: 3000 > time : 139.622739 or 17.693806 > Mflops : 30529.327312 > n: 3100 > time : 154.344971 or 19.566886 > Mflops : 30460.247702 > n: 3200 > time : 169.507739 or 21.467100 > Mflops : 30538.116602 > n: 3300 > time : 186.363773 or 23.615281 > Mflops : 30444.600545 > n: 3400 > time : 203.798979 or 25.817667 > Mflops : 30456.322788 > n: 3500 > ... > [3] > % ./dgemm > n: 3000 > time : 134.673079 or 16.958682 > Mflops : 31852.711082 > n: 3100 > time : 148.410085 or 18.663248 > Mflops : 31935.073574 > n: 3200 > time : 162.835473 or 20.468825 > Mflops : 32027.475770 > n: 3300 > time : 179.025370 or 22.479189 > Mflops : 31983.262501 > n: 3400 > time : 195.859710 or 24.663009 > Mflops : 31882.208788 > n: 3500 > [4] > % ./dgemm > n: 3000 > time : 54.259647 or 14.684309 > Mflops : 36786.204907 > n: 3100 > time : 60.899147 or 17.124599 > Mflops : 34804.447141 > n: 3200 > time : 64.295342 or 17.490787 > Mflops : 37480.577569 > n: 3300 > time : 69.781247 or 18.288840 > Mflops : 39311.284796 > n: 3400 > time : 79.234397 or 21.829736 > Mflops : 36020.187858 > n: 3500 > time : 83.905419 or 22.381237 > Mflops : 38324.289174 > n: 3600 > time : 92.195022 or 25.105942 > Mflops : 37177.621122 > n: 3700 > time : 97.718841 or 25.434243 > Mflops : 39841.319494 > n: 3800 > time : 105.740463 or 27.414029 > Mflops : 40042.592613 > n: 3900 > time : 113.980157 or 29.678505 > Mflops : 39984.635420 > n: 4000 > time : 122.941569 or 31.946174 > Mflops : 40077.412531 > n: 4100 > ---DETAILS--- > > > From: Adam Vande More > Subject: Re: How to reproduce: Re: Only 70% of theoretical peak performance > on FreeBSD 8/amd64, Corei7 920 > Date: Wed, 14 Apr 2010 11:34:45 -0500 > >>> > time : 162.45 or 20.430651 >>> > Mflops : 32087.318295 >>> > n: 3300 >>> > time : 178.497079 or 22.446093 >>> > Mflops : 32030.420499 >>> > n: 3400 >>> > time : 195.550715 or 24.586152 >>> > Mflops : 31981.873273 >>> > n: 3500 >>> > time : 213.403379 or 26.825058 >>> > Mflops : 31975.513363 >>> > n: 3600 >>> > ... >>> > above output is on Core i7 920 (2.66GHz; TurboBoost on) >>> >>> My results: >>> $ ./dgemm >>> n: 3000 >>> time : 54.151302 or 28.189781 >>> Mflops : 19162.263125 >>> n: 3100 >>> time : 60.157449 or 32.214141 >>> Mflops : 18501.570537 >>> n: 3200 >>> time : 65.753191 or 34.114872 >>> Mflops : 19216.393378 >>> >>> CPU: >>> CPU: Intel(R) Core(TM)2 Duo CPU E7300 @ 2.66GHz (2653.35-MHz K8-class >>> CPU) >>> Origin = "Genuin
Re: Linux static linked ver doesn't work on FBSD (Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920)
On Wed, Apr 14, 2010 at 10:26 PM, Maho NAKATA wrote: > From: Pieter de Goeje > Subject: Re: How to reproduce: Re: Only 70% of theoretical peak performance > on FreeBSD 8/amd64, Corei7 920 > Date: Wed, 14 Apr 2010 16:05:18 +0200 > >> I think the best test would be to run a statically compiled linux binary on >> FreeBSD. That way the compiler settings are exactly the same. > > It is not possible for Linux amd64 binary to run on FreeBSD amd64, > ...and not i386 version neither. GotoBLAS uses special systeml call. > > % ./dgemm > linux_sys_futex: unknown op 265 > linux: pid 1264 (dgemm): syscall mbind not implemented > n: 3000 > ^C > just halt. Yes, and while this isn't directly tied into numa, mbind(2), mempolicy(2), and a few others use the same facilities that are available via plain numa. I know because of messes I've tried to clean up in these areas. I'm really not sure why this is using numa though to be honest... Thanks, -Garrett ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: HyperThreading makes worse to me (was Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920)
On Wed, Apr 14, 2010 at 9:21 PM, Ian Smith wrote: > On Wed, 14 Apr 2010, Garrett Cooper wrote: > > On Wed, Apr 14, 2010 at 7:49 PM, Garrett Cooper wrote: > > > On Wed, Apr 14, 2010 at 5:46 PM, Maho NAKATA wrote: > > >> Hi Andry and Adam > > >> > > >> My test again. No desktop, etc. I just run dgemm. > > >> Contrary to Adam's result, Hyper Threading makes the performance worse. > > >> all tests are done on Core i7 920 @ 2.67GHz. (TurboBoost @2.8GHz) > > >> > > >> Turbo Boost off, Hyper threading off: 82% (35GFlops) [1] > > >> Turbo Boost off, Hyper threading off: 72% (30.5GFlops) [2] > > Er, shouldn't one of those say HTT on? and/or Turbo boost on? Else > they're both the same test as [4] but with different results? There's a problem with 8.x+ cores reported by the kernel. For some odd reason more recent Intel processors aren't reporting themselves as HT-enabled when they have HT-cores (see: kern/145385). I didn't look into the issue too hard, but since it does seem to be a major performance loss perhaps I should; besides, it would be good experience to put under my belt :]. > > >> Turbo Boost on, Hyper threading on: 71% (32GFlops) [3] > > >> Turbo Boost off, Hyper threading off: 84-89% (38-40GFlops) [4] > > Clarification of all four possible test configs - 8 if you add pinning > CPUs or not - might make this a bit clearer? > > > > Doesn't this make sense? Hyperthreaded cores in Intel procs still > > > provide an incomplete set of registers as they're logical processors, > > > so I would expect for things to be slower if they're automatically run > > > on the SMT cores instead of the physical ones. > > Since we're talking FP, do HTT 'cores' share an FPU, or have their own? > If contended, you'd have to expect worse (at least FP) performance, no? Ah, that's another excellent point. What instructions is dgemm using -- pure integer based arithmetic, floating point arithmetic, specialized operations that would benefit from using SIMD, etc? > > > Is there a weighting scheme to SCHED_ULE where logical processors > > > (like the SMT variety) get a lower score than real processors do, and > > > thus get scheduled for less intensive interrupting tasks, or maybe > > > just don't get scheduled in high use scenarios like it would if it was > > > a physical processor? > > > > Err... wait. Didn't see that the turbo boost results didn't scale > > linearly or align with one another until just a sec ago. Nevermind my > > previous comment. > > Waiting for the fog to lift .. As am I. I don't know enough in this area, but I'm definitely open to learning. Thanks, -Garrett ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
on 15/04/2010 04:20 Maho NAKATA said the following: > Hi Andriy and Adam, > > I did also the same thing as suggested. > > my conclusion: on Core i7 920, 2.66GHz, TurboBoost on, HyperThreading off, So HyperThreading is off. > then, pinned to each core like following > > % procstat -t 1408 > PIDTID COMM TDNAME CPU PRI STATE WCHAN > 1408 100160 dgemm- 3 190 run - > 1408 100161 dgemm- 2 190 run - > 1408 100162 dgemm- 2 190 run - > 1408 100163 dgemm- 1 189 run - > 1408 100164 dgemm- 0 190 run - > 1408 100165 dgemm- 3 189 run - > 1408 100166 dgemm- 1 190 run - > 1408 100167 dgemminitial thread 0 190 run - But there are still 8 threads. Can you check how many threads you have on Linux with the same configuration? Is it possible to tell GotoBLAS to use 4 threads? If yes, can you also test that scenario? Also, would it be possible for you to test recent 8-STABLE? Just for the sake of experiment. -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Linux static linked ver doesn't work on FBSD (Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920)
From: Pieter de Goeje Subject: Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920 Date: Wed, 14 Apr 2010 16:05:18 +0200 > I think the best test would be to run a statically compiled linux binary on > FreeBSD. That way the compiler settings are exactly the same. It is not possible for Linux amd64 binary to run on FreeBSD amd64, ...and not i386 version neither. GotoBLAS uses special systeml call. % ./dgemm linux_sys_futex: unknown op 265 linux: pid 1264 (dgemm): syscall mbind not implemented n: 3000 ^C just halt. -- Nakata Maho http://accc.riken.jp/maho/ , JA OOO http://ja.openoffice.org/ Blog: http://blog.goo.ne.jp/nakatamaho/ , GPG key: http://accc.riken.jp/maho/maho.pgp.txt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: HyperThreading makes worse to me (was Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920)
On Wed, 14 Apr 2010, Garrett Cooper wrote: > On Wed, Apr 14, 2010 at 7:49 PM, Garrett Cooper wrote: > > On Wed, Apr 14, 2010 at 5:46 PM, Maho NAKATA wrote: > >> Hi Andry and Adam > >> > >> My test again. No desktop, etc. I just run dgemm. > >> Contrary to Adam's result, Hyper Threading makes the performance worse. > >> all tests are done on Core i7 920 @ 2.67GHz. (TurboBoost @2.8GHz) > >> > >> Turbo Boost off, Hyper threading off: 82% (35GFlops) [1] > >> Turbo Boost off, Hyper threading off: 72% (30.5GFlops) [2] Er, shouldn't one of those say HTT on? and/or Turbo boost on? Else they're both the same test as [4] but with different results? > >> Turbo Boost on, Hyper threading on: 71% (32GFlops) [3] > >> Turbo Boost off, Hyper threading off: 84-89% (38-40GFlops) [4] Clarification of all four possible test configs - 8 if you add pinning CPUs or not - might make this a bit clearer? > > Doesn't this make sense? Hyperthreaded cores in Intel procs still > > provide an incomplete set of registers as they're logical processors, > > so I would expect for things to be slower if they're automatically run > > on the SMT cores instead of the physical ones. Since we're talking FP, do HTT 'cores' share an FPU, or have their own? If contended, you'd have to expect worse (at least FP) performance, no? > > Is there a weighting scheme to SCHED_ULE where logical processors > > (like the SMT variety) get a lower score than real processors do, and > > thus get scheduled for less intensive interrupting tasks, or maybe > > just don't get scheduled in high use scenarios like it would if it was > > a physical processor? > > Err... wait. Didn't see that the turbo boost results didn't scale > linearly or align with one another until just a sec ago. Nevermind my > previous comment. Waiting for the fog to lift .. cheers, Ian___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: HyperThreading makes worse to me (was Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920)
On Wed, Apr 14, 2010 at 7:49 PM, Garrett Cooper wrote: > On Wed, Apr 14, 2010 at 5:46 PM, Maho NAKATA wrote: >> Hi Andry and Adam >> >> My test again. No desktop, etc. I just run dgemm. >> Contrary to Adam's result, Hyper Threading makes the performance worse. >> all tests are done on Core i7 920 @ 2.67GHz. (TurboBoost @2.8GHz) >> >> Turbo Boost off, Hyper threading off: 82% (35GFlops) [1] >> Turbo Boost off, Hyper threading off: 72% (30.5GFlops) [2] >> >> Turbo Boost on, Hyper threading on: 71% (32GFlops) [3] >> Turbo Boost off, Hyper threading off: 84-89% (38-40GFlops) [4] > > Doesn't this make sense? Hyperthreaded cores in Intel procs still > provide an incomplete set of registers as they're logical processors, > so I would expect for things to be slower if they're automatically run > on the SMT cores instead of the physical ones. > > Is there a weighting scheme to SCHED_ULE where logical processors > (like the SMT variety) get a lower score than real processors do, and > thus get scheduled for less intensive interrupting tasks, or maybe > just don't get scheduled in high use scenarios like it would if it was > a physical processor? Err... wait. Didn't see that the turbo boost results didn't scale linearly or align with one another until just a sec ago. Nevermind my previous comment. -Garrett ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: HyperThreading makes worse to me (was Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920)
On Wed, Apr 14, 2010 at 5:46 PM, Maho NAKATA wrote: > Hi Andry and Adam > > My test again. No desktop, etc. I just run dgemm. > Contrary to Adam's result, Hyper Threading makes the performance worse. > all tests are done on Core i7 920 @ 2.67GHz. (TurboBoost @2.8GHz) > > Turbo Boost off, Hyper threading off: 82% (35GFlops) [1] > Turbo Boost off, Hyper threading off: 72% (30.5GFlops) [2] > > Turbo Boost on, Hyper threading on: 71% (32GFlops) [3] > Turbo Boost off, Hyper threading off: 84-89% (38-40GFlops) [4] Doesn't this make sense? Hyperthreaded cores in Intel procs still provide an incomplete set of registers as they're logical processors, so I would expect for things to be slower if they're automatically run on the SMT cores instead of the physical ones. Is there a weighting scheme to SCHED_ULE where logical processors (like the SMT variety) get a lower score than real processors do, and thus get scheduled for less intensive interrupting tasks, or maybe just don't get scheduled in high use scenarios like it would if it was a physical processor? Thanks, -Garrett ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
Hi Adam, From: Adam Vande More Subject: Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920 Date: Wed, 14 Apr 2010 12:47:31 -0500 > Since this is a full fledged desktop environment, 90% utilization seems > pretty good. No, I don't think so. Even on Ubuntu, mine is running on a full desktop environment, GotoBLAS's performance is about 95% using dgemm. dgemm on Linux is lot more stabler than FreeBSD and clearly faster. on Ubuntu $ ./dgemm n: 3000 time : 51.18 or 12.795519 Mflops : 42216.341930 n: 3100 time : 56.28 or 14.261719 Mflops : 41791.049205 n: 3200 time : 61.35 or 15.631380 Mflops : 41939.023080 n: 3300 time : 67.79 or 17.247202 Mflops : 41685.474166 n: 3400 time : 73.80 or 18.471321 Mflops : 42569.300032 n: 3500 time : 81.48 or 20.781936 Mflops : 41273.585044 n: 3600 time : 88.17 or 22.816965 Mflops : 40907.246233 n: 3700 time : 95.21 or 23.864101 Mflops : 42462.684969 n: 3800 thanks -- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/ Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
Hi Andriy and Adam, I did also the same thing as suggested. my conclusion: on Core i7 920, 2.66GHz, TurboBoost on, HyperThreading off, My result of dgemm GotoBLAS performance was following. *summary of result 36-39GFlops 81-87% of peak performance without pinning 35-40GFlops 78-89% of peak performance with pinning my observation * performance is somewhat unstable like 35GFlops then next calculation is 40GFlops...and flips etc. jittering is observed. * pinning makes performance somewhat stabler, but we don't gain a bit more. Details. First I ran %./dgemm n: 3500 time : 84.431008 or 22.428125 Mflops : 38244.168629 n: 3600 time : 90.162220 or 23.440381 Mflops : 39819.284422 n: 3700 time : 101.427504 or 27.404345 Mflops : 36977.121646 Note: 36-39GFlops 81-87% of peak performance then, pinned to each core like following % procstat -t 1408 PIDTID COMM TDNAME CPU PRI STATE WCHAN 1408 100160 dgemm- 3 190 run - 1408 100161 dgemm- 2 190 run - 1408 100162 dgemm- 2 190 run - 1408 100163 dgemm- 1 189 run - 1408 100164 dgemm- 0 190 run - 1408 100165 dgemm- 3 189 run - 1408 100166 dgemm- 1 190 run - 1408 100167 dgemminitial thread 0 190 run - % cpuset -t 100160 -l 0 % cpuset -t 100161 -l 0 % cpuset -t 100162 -l 1 % cpuset -t 100163 -l 1 % cpuset -t 100164 -l 2 % cpuset -t 100165 -l 2 % cpuset -t 100166 -l 3 % cpuset -t 100167 -l 3 then, % procstat -t 1408 PIDTID COMM TDNAME CPU PRI STATE WCHAN 1408 100160 dgemm- 0 191 run - 1408 100161 dgemm- 0 191 run - 1408 100162 dgemm- 1 190 run - 1408 100163 dgemm- 1 190 run - 1408 100164 dgemm- 2 190 run - 1408 100165 dgemm- 2 190 run - 1408 100166 dgemm- 3 190 run - 1408 100167 dgemminitial thread 3 190 run - n: 4000 time : 121.907696 or 31.475052 Mflops : 40677.295630 n: 4100 time : 139.842701 or 38.702532 Mflops : 35624.444587 n: 4200 time : 143.622179 or 36.725949 Mflops : 40356.011158 n: 4300 time : 153.742976 or 39.465752 Mflops : 40301.013511 n: 4400 time : 164.919566 or 42.380653 Mflops : 40208.611317 n: 4500 time : 175.930335 or 45.422572 Mflops : 40132.139469 Thanks From: Adam Vande More Subject: Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920 Date: Wed, 14 Apr 2010 12:47:31 -0500 > On Wed, Apr 14, 2010 at 11:51 AM, Andriy Gapon wrote: > >> on 14/04/2010 19:45 Adam Vande More said the following: >> > >> > also if I run cpuset on the dgemm then the utilization is basically at >> > the theoretical max for one core so at least that part is working. >> >> You can also try procstat -t to find out thread IDs and cpuset -t to >> pin the >> threads to the cores. >> > > it gets to around 90% doing that. > > time : 103.617271 or 27.140992 > Mflops : 47172.925449 > n: 4100 > time : 113.910669 or 30.520677 > Mflops : 45174.496186 > n: 4200 > time : 121.880695 or 32.068070 > Mflops : 46217.711013 > n: 4300 > > tried a couple of different thread orders but didn't seem to make a > difference. > > galacticdominator% procstat -t 1922 > PIDTID COMM TDNAME CPU PRI STATE WCHAN > 1922 100092 dgemminitial thread 0 190 run - > 1922 100268 dgemm- 1 190 run - > 1922 100270 dgemm- 1 191 run - > 1922 100272 dgemm- 3 190 run - > 1922 100273 dgemm- 2 191 run - > 1922 100274 dgemm- 2 191 run - > 1922 100282 dgemm- 0 190 run - > 1922 100283 dgemm- 3 190 run - > > galacticdominator% cpuset -t 100092 -l 0 > galacticdominator% cpuset -t 100268 -l 1 > galacticdominator% cpuset -t 100270 -l 2 > galacticdominator% cpuset -t 100272 -l 3 > galacticdominator% cpuset -t 100273 -l 0 > galacticdominator% cpuset -t 100274 -l 1 > galacticdominator% cpuset -t 100282 -l 2 > galacticdominator% cpuset -t 100283 -l 3 > > > galacticdominator% cpuset -t 100092 -l 0 > galacticdominator% cpuset -t 100268 -l 0 > galacticdomin
Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
opps I missed this e-mail... From: Adam Vande More Subject: Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920 Date: Wed, 14 Apr 2010 11:45:04 -0500 > On Wed, Apr 14, 2010 at 11:34 AM, Adam Vande More > wrote: > >> >> >> >> That's about 67% utilization, turning off HTT drops it more. HTT on the >> newer cores is good, not bad. >> > > Well that was completely contrarty to some tests I'd run when I first got > the cpu. > > With HTT off: > > n: 3000 > time : 44.705516 or 11.760183 > Mflops : 45932.959253 > n: 3100 > time : 50.598581 or 14.270123 > Mflops : 41766.437458 > n: 3200 > time : 55.748192 or 15.780977 > Mflops : 41541.458400 > n: 3300 > time : 62.072217 or 17.441431 > Mflops : 41221.262070 > n: 3400 > > so that's about 79% right there. > > also if I run cpuset on the dgemm then the utilization is basically at the > theoretical max for one core so at least that part is working. > > -- > Adam Vande More ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
HyperThreading makes worse to me (was Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920)
Hi Andry and Adam My test again. No desktop, etc. I just run dgemm. Contrary to Adam's result, Hyper Threading makes the performance worse. all tests are done on Core i7 920 @ 2.67GHz. (TurboBoost @2.8GHz) Turbo Boost off, Hyper threading off: 82% (35GFlops)[1] Turbo Boost off, Hyper threading off: 72% (30.5GFlops) [2] Turbo Boost on, Hyper threading on: 71% (32GFlops)[3] Turbo Boost off, Hyper threading off: 84-89% (38-40GFlops) [4] ---my system--- CPU: Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz (2683.44-MHz K8-class CPU) Origin = "GenuineIntel" Id = 0x106a5 Stepping = 5 Features=0xbfebfbff Features2=0x98e3bd AMD Features=0x28100800 AMD Features2=0x1 TSC: P-state invariant real memory = 12884901888 (12288 MB) avail memory = 12387717120 (11813 MB) ACPI APIC Table: <110909 APIC1026> FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs FreeBSD/SMP: 1 package(s) x 4 core(s) ---my system--- ---DETAILS--- [1] % ./dgemm n: 3000 time : 57.666717 or 16.339074 Mflops : 33060.624827 n: 3100 time : 61.502677 or 16.597376 Mflops : 35910.025544 n: 3200 time : 69.075401 or 19.199833 Mflops : 34144.297133 n: 3300 time : 73.699540 or 19.633594 Mflops : 36618.756539 n: 3400 time : 82.256194 or 22.373651 Mflops : 35144.518837 n: 3500 time : 88.975662 or 24.118761 Mflops : 35563.394249 n: 3600 time : 96.436652 or 26.027588 Mflops : 35861.148385 n: 3700 [2] % ./dgemm n: 3000 time : 139.622739 or 17.693806 Mflops : 30529.327312 n: 3100 time : 154.344971 or 19.566886 Mflops : 30460.247702 n: 3200 time : 169.507739 or 21.467100 Mflops : 30538.116602 n: 3300 time : 186.363773 or 23.615281 Mflops : 30444.600545 n: 3400 time : 203.798979 or 25.817667 Mflops : 30456.322788 n: 3500 ... [3] % ./dgemm n: 3000 time : 134.673079 or 16.958682 Mflops : 31852.711082 n: 3100 time : 148.410085 or 18.663248 Mflops : 31935.073574 n: 3200 time : 162.835473 or 20.468825 Mflops : 32027.475770 n: 3300 time : 179.025370 or 22.479189 Mflops : 31983.262501 n: 3400 time : 195.859710 or 24.663009 Mflops : 31882.208788 n: 3500 [4] % ./dgemm n: 3000 time : 54.259647 or 14.684309 Mflops : 36786.204907 n: 3100 time : 60.899147 or 17.124599 Mflops : 34804.447141 n: 3200 time : 64.295342 or 17.490787 Mflops : 37480.577569 n: 3300 time : 69.781247 or 18.288840 Mflops : 39311.284796 n: 3400 time : 79.234397 or 21.829736 Mflops : 36020.187858 n: 3500 time : 83.905419 or 22.381237 Mflops : 38324.289174 n: 3600 time : 92.195022 or 25.105942 Mflops : 37177.621122 n: 3700 time : 97.718841 or 25.434243 Mflops : 39841.319494 n: 3800 time : 105.740463 or 27.414029 Mflops : 40042.592613 n: 3900 time : 113.980157 or 29.678505 Mflops : 39984.635420 n: 4000 time : 122.941569 or 31.946174 Mflops : 40077.412531 n: 4100 ---DETAILS--- From: Adam Vande More Subject: Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920 Date: Wed, 14 Apr 2010 11:34:45 -0500 >> > time : 162.45 or 20.430651 >> > Mflops : 32087.318295 >> > n: 3300 >> > time : 178.497079 or 22.446093 >> > Mflops : 32030.420499 >> > n: 3400 >> > time : 195.550715 or 24.586152 >> > Mflops : 31981.873273 >> > n: 3500 >> > time : 213.403379 or 26.825058 >> > Mflops : 31975.513363 >> > n: 3600 >> > ... >> > above output is on Core i7 920 (2.66GHz; TurboBoost on) >> >> My results: >> $ ./dgemm >> n: 3000 >> time : 54.151302 or 28.189781 >> Mflops : 19162.263125 >> n: 3100 >> time : 60.157449 or 32.214141 >> Mflops : 18501.570537 >> n: 3200 >> time : 65.753191 or 34.114872 >> Mflops : 19216.393378 >> >> CPU: >> CPU: Intel(R) Core(TM)2 Duo CPU E7300 @ 2.66GHz (2653.35-MHz K8-class >> CPU) >> Origin = "GenuineIntel" Id = 0x10676 Stepping = 6 >> >> >> Features=0xbfebfbff >> >> >> Features2=0x8e39d >> AMD Features=0x20100800 >> AMD Features2=0x1 >> TSC: P-state invariant >> ⋮ >> FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs >> FreeBSD/SMP: 1 package(s) x 2 core(s) >> >> FreeBSD: >> FreeBSD 8.0-STABLE r205070 amd64 >> >> Please note that the system was not dedicated to the test, I had >> Xorg+KDE3+thunderbird+skype+kopete+konsole(s) plus a bunch of daemons >> running. >> That probably explains irregularities in the results. >> >> I am not sure how exactly theoretical maximum should be calculated, I used >> 2 * >> 2.66G * 4 ≈ 21.3G. >> And so 19.2G / 21.3G ≈ 90%. >> >> Not as bad as what you get. >> Although not as good as what you report for Linux. >> But given the impurity and imprecision of my test… >> >> P.S. the machine is two-core obviously :-) >> Don't have anything with more cpus/cores handy. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
From: Andriy Gapon Subject: Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920 Date: Wed, 14 Apr 2010 16:19:13 +0300 > on 14/04/2010 02:21 Maho NAKATA said the following: >> 2. install ports/math/gotoblas (manual download required) >> make install > > > Do you know how gotoblas on Linux was obtained? Yes. Just download the archive. > Was it built from source? Yes. > Has it come pre-packaged? No. > If so, can you find out details of its build configuration? I'm not sure I build like following on Ubuntu 9.10 amd64. $ tar xvfz GotoBLAS2-1.13.tar.gz $ cd GotoBLAS2 $ ./quickbuild.64bit ln -fs libgoto2_nehalemp-r1.13.a libgoto2.a for d in interface driver/level2 driver/level3 driver/others kernel lapack ; \ do if test -d $d; then \ make -j 8 -C $d libs || exit 1 ; \ fi; \ done make[1]: Entering directory `/home/maho/a/GotoBLAS2/interface' gcc -O2 -DEXPRECISION -m128bit-long-double -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DSMP_SERVER -DMAX_CPU_NUMBER=8 -DASMNAME=saxpy -DASMFNAME=saxpy_ -DNAME=saxpy_ -DCNAME=saxpy -DCHAR_NAME=\"saxpy_\" -DCHAR_CNAME=\"saxpy\" -I.. -I. -UDOUBLE -UCOMPLEX -c axpy.c -o saxpy.o gcc -O2 -DEXPRECISION -m128bit-long-double -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DSMP_SERVER -DMAX_CPU_NUMBER=8 -DASMNAME=sswap -DASMFNAME=sswap_ -DNAME=sswap_ -DCNAME=sswap -DCHAR_NAME=\"sswap_\" -DCHAR_CNAME=\"sswap\" -I.. -I. -UDOUBLE -UCOMPLEX -c swap.c -o sswap.o gcc -O2 -DEXPRECISION -m128bit-long-double -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DSMP_SERVER -DMAX_CPU_NUMBER=8 -DASMNAME=scopy -DASMFNAME=scopy_ -DNAME=scopy_ -DCNAME=scopy -DCHAR_NAME=\"scopy_\" -DCHAR_CNAME=\"scopy\" -I.. -I. -UDOUBLE -UCOMPLEX -c copy.c -o scopy.o gcc -O2 -DEXPRECISION -m128bit-long-double -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DSMP_SERVER -DMAX_CPU_NUMBER=8 -DASMNAME=sscal -DASMFNAME=sscal_ -DNAME=sscal_ -DCNAME=sscal -DCHAR_NAME=\"sscal_\" -DCHAR_CNAME=\"sscal\" -I.. -I. -UDOUBLE -UCOMPLEX -c scal.c -o sscal.o -- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/ Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
On Wed, Apr 14, 2010 at 11:51 AM, Andriy Gapon wrote: > on 14/04/2010 19:45 Adam Vande More said the following: > > > > also if I run cpuset on the dgemm then the utilization is basically at > > the theoretical max for one core so at least that part is working. > > You can also try procstat -t to find out thread IDs and cpuset -t to > pin the > threads to the cores. > it gets to around 90% doing that. time : 103.617271 or 27.140992 Mflops : 47172.925449 n: 4100 time : 113.910669 or 30.520677 Mflops : 45174.496186 n: 4200 time : 121.880695 or 32.068070 Mflops : 46217.711013 n: 4300 tried a couple of different thread orders but didn't seem to make a difference. galacticdominator% procstat -t 1922 PIDTID COMM TDNAME CPU PRI STATE WCHAN 1922 100092 dgemminitial thread 0 190 run - 1922 100268 dgemm- 1 190 run - 1922 100270 dgemm- 1 191 run - 1922 100272 dgemm- 3 190 run - 1922 100273 dgemm- 2 191 run - 1922 100274 dgemm- 2 191 run - 1922 100282 dgemm- 0 190 run - 1922 100283 dgemm- 3 190 run - galacticdominator% cpuset -t 100092 -l 0 galacticdominator% cpuset -t 100268 -l 1 galacticdominator% cpuset -t 100270 -l 2 galacticdominator% cpuset -t 100272 -l 3 galacticdominator% cpuset -t 100273 -l 0 galacticdominator% cpuset -t 100274 -l 1 galacticdominator% cpuset -t 100282 -l 2 galacticdominator% cpuset -t 100283 -l 3 galacticdominator% cpuset -t 100092 -l 0 galacticdominator% cpuset -t 100268 -l 0 galacticdominator% cpuset -t 100270 -l 1 galacticdominator% cpuset -t 100272 -l 1 galacticdominator% cpuset -t 100273 -l 2 galacticdominator% cpuset -t 100274 -l 2 galacticdominator% cpuset -t 100282 -l 3 galacticdominator% cpuset -t 100283 -l 3 This is from the second set: time : 150.348850 or 40.488350 Mflops : 45022.951141 n: 4600 time : 161.968982 or 43.589618 Mflops : 44669.884500 n: 4700 Since this is a full fledged desktop environment, 90% utilization seems pretty good. I'm no expert Andriy, but it seems like if gotoblas implemented some of the FreeBSD optimizations then we'd be in the same ballpark. -- Adam Vande More ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
on 14/04/2010 19:45 Adam Vande More said the following: > > also if I run cpuset on the dgemm then the utilization is basically at > the theoretical max for one core so at least that part is working. You can also try procstat -t to find out thread IDs and cpuset -t to pin the threads to the cores. -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
On Wed, Apr 14, 2010 at 11:34 AM, Adam Vande More wrote: > > > > That's about 67% utilization, turning off HTT drops it more. HTT on the > newer cores is good, not bad. > Well that was completely contrarty to some tests I'd run when I first got the cpu. With HTT off: n: 3000 time : 44.705516 or 11.760183 Mflops : 45932.959253 n: 3100 time : 50.598581 or 14.270123 Mflops : 41766.437458 n: 3200 time : 55.748192 or 15.780977 Mflops : 41541.458400 n: 3300 time : 62.072217 or 17.441431 Mflops : 41221.262070 n: 3400 so that's about 79% right there. also if I run cpuset on the dgemm then the utilization is basically at the theoretical max for one core so at least that part is working. -- Adam Vande More ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
On Wed, Apr 14, 2010 at 10:26 AM, Andriy Gapon wrote: > on 14/04/2010 02:21 Maho NAKATA said the following: > > 4. run dgemm. > > % ./dgemm > > n: 3000 > > time : 134.648208 or 16.910525 > > Mflops : 31943.419695 > > n: 3100 > > time : 148.122279 or 18.615284 > > Mflops : 32017.357408 > > n: 3200 > > time : 162.45 or 20.430651 > > Mflops : 32087.318295 > > n: 3300 > > time : 178.497079 or 22.446093 > > Mflops : 32030.420499 > > n: 3400 > > time : 195.550715 or 24.586152 > > Mflops : 31981.873273 > > n: 3500 > > time : 213.403379 or 26.825058 > > Mflops : 31975.513363 > > n: 3600 > > ... > > above output is on Core i7 920 (2.66GHz; TurboBoost on) > > My results: > $ ./dgemm > n: 3000 > time : 54.151302 or 28.189781 > Mflops : 19162.263125 > n: 3100 > time : 60.157449 or 32.214141 > Mflops : 18501.570537 > n: 3200 > time : 65.753191 or 34.114872 > Mflops : 19216.393378 > > CPU: > CPU: Intel(R) Core(TM)2 Duo CPU E7300 @ 2.66GHz (2653.35-MHz K8-class > CPU) > Origin = "GenuineIntel" Id = 0x10676 Stepping = 6 > > > Features=0xbfebfbff > > Features2=0x8e39d > AMD Features=0x20100800 > AMD Features2=0x1 > TSC: P-state invariant > ⋮ > FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs > FreeBSD/SMP: 1 package(s) x 2 core(s) > > FreeBSD: > FreeBSD 8.0-STABLE r205070 amd64 > > Please note that the system was not dedicated to the test, I had > Xorg+KDE3+thunderbird+skype+kopete+konsole(s) plus a bunch of daemons > running. > That probably explains irregularities in the results. > > I am not sure how exactly theoretical maximum should be calculated, I used > 2 * > 2.66G * 4 ≈ 21.3G. > And so 19.2G / 21.3G ≈ 90%. > > Not as bad as what you get. > Although not as good as what you report for Linux. > But given the impurity and imprecision of my test… > > P.S. the machine is two-core obviously :-) > Don't have anything with more cpus/cores handy. > > P.P.S. Having _only glimpsed_ at the source I think that there are some > things > that GotoBLAS doesn't try to do on FreeBSD that it tries to do on Linux. > Like setting CPU-affinity for the threads, or avoiding HTT pseudo-cores. > Those things are possible on FreeBSD. > Perhaps, there are more things like that. > > Mine is also a live desktop enviro, kde4+ n: 3000 time : 116.377609 or 16.696066 Mflops : 32353.729042 n: 3100 time : 127.230336 or 17.274867 Mflops : 34501.695325 n: 3200 time : 139.018175 or 18.342056 Mflops : 35741.074976 n: 3300 time : 152.519365 or 20.154714 Mflops : 35671.942364 n: 3400 time : 166.248145 or 21.952426 Mflops : 35818.874941 n: 3500 time : 182.565385 or 24.492597 Mflops : 35020.581786 n: 3600 time : 198.551018 or 26.906992 Mflops : 34689.094992 n: 3700 time : 215.428919 or 28.574964 Mflops : 35462.294838 n: 3800 ^C CPU: Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz (3313.71-MHz K8-class CPU) Origin = "GenuineIntel" Id = 0x106e5 Family = 6 Model = 1e Stepping = 5 Features=0xbfebfbff Features2=0x98e3fd AMD Features=0x28100800 AMD Features2=0x1 TSC: P-state invariant That's about 67% utilization, turning off HTT drops it more. HTT on the newer cores is good, not bad. -- Adam Vande More ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
on 14/04/2010 02:21 Maho NAKATA said the following: > 4. run dgemm. > % ./dgemm > n: 3000 > time : 134.648208 or 16.910525 > Mflops : 31943.419695 > n: 3100 > time : 148.122279 or 18.615284 > Mflops : 32017.357408 > n: 3200 > time : 162.45 or 20.430651 > Mflops : 32087.318295 > n: 3300 > time : 178.497079 or 22.446093 > Mflops : 32030.420499 > n: 3400 > time : 195.550715 or 24.586152 > Mflops : 31981.873273 > n: 3500 > time : 213.403379 or 26.825058 > Mflops : 31975.513363 > n: 3600 > ... > above output is on Core i7 920 (2.66GHz; TurboBoost on) My results: $ ./dgemm n: 3000 time : 54.151302 or 28.189781 Mflops : 19162.263125 n: 3100 time : 60.157449 or 32.214141 Mflops : 18501.570537 n: 3200 time : 65.753191 or 34.114872 Mflops : 19216.393378 CPU: CPU: Intel(R) Core(TM)2 Duo CPU E7300 @ 2.66GHz (2653.35-MHz K8-class CPU) Origin = "GenuineIntel" Id = 0x10676 Stepping = 6 Features=0xbfebfbff Features2=0x8e39d AMD Features=0x20100800 AMD Features2=0x1 TSC: P-state invariant ⋮ FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs FreeBSD/SMP: 1 package(s) x 2 core(s) FreeBSD: FreeBSD 8.0-STABLE r205070 amd64 Please note that the system was not dedicated to the test, I had Xorg+KDE3+thunderbird+skype+kopete+konsole(s) plus a bunch of daemons running. That probably explains irregularities in the results. I am not sure how exactly theoretical maximum should be calculated, I used 2 * 2.66G * 4 ≈ 21.3G. And so 19.2G / 21.3G ≈ 90%. Not as bad as what you get. Although not as good as what you report for Linux. But given the impurity and imprecision of my test… P.S. the machine is two-core obviously :-) Don't have anything with more cpus/cores handy. P.P.S. Having _only glimpsed_ at the source I think that there are some things that GotoBLAS doesn't try to do on FreeBSD that it tries to do on Linux. Like setting CPU-affinity for the threads, or avoiding HTT pseudo-cores. Those things are possible on FreeBSD. Perhaps, there are more things like that. -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
On Wednesday 14 April 2010 15:19:13 Andriy Gapon wrote: > on 14/04/2010 02:21 Maho NAKATA said the following: > > 2. install ports/math/gotoblas (manual download required) > > make install > > Do you know how gotoblas on Linux was obtained? > Was it built from source? > Has it come pre-packaged? > If so, can you find out details of its build configuration? > > Thanks! I think the best test would be to run a statically compiled linux binary on FreeBSD. That way the compiler settings are exactly the same. - Pieter ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
on 14/04/2010 02:21 Maho NAKATA said the following: > 2. install ports/math/gotoblas (manual download required) > make install Do you know how gotoblas on Linux was obtained? Was it built from source? Has it come pre-packaged? If so, can you find out details of its build configuration? Thanks! -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
Hi all, thanks for showing interest in this issue. I uploaded my test code so that you can test on your PC. Following is the instruction. 1. download my source codes. http://people.freebsd.org/~maho/dgemm/Makefile http://people.freebsd.org/~maho/dgemm/dgemm.cpp check md5. % md5 Makefile dgemm.cpp MD5 (Makefile) = b408ab1e1f5bf8b923cae5ec9f9f0f07 MD5 (dgemm.cpp) = 0d774a456a665429c67c2b07fd24c64c 2. install ports/math/gotoblas (manual download required) make install 3. compile dgemm.cpp just type make % make g++44 -pthread -static -O2 -o dgemm dgemm.cpp -L/usr/local/lib -lgoto2p g++44 -pthread -static -O2 -o dgemm_ref dgemm.cpp -L/usr/local/lib -lblas -lgfortran 4. run dgemm. % ./dgemm n: 3000 time : 134.648208 or 16.910525 Mflops : 31943.419695 n: 3100 time : 148.122279 or 18.615284 Mflops : 32017.357408 n: 3200 time : 162.45 or 20.430651 Mflops : 32087.318295 n: 3300 time : 178.497079 or 22.446093 Mflops : 32030.420499 n: 3400 time : 195.550715 or 24.586152 Mflops : 31981.873273 n: 3500 time : 213.403379 or 26.825058 Mflops : 31975.513363 n: 3600 ... above output is on Core i7 920 (2.66GHz; TurboBoost on) Thanks -- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/ Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
on 13/04/2010 02:33 Maho NAKATA said the following: > From: Andriy Gapon >> Another question is what compilers (what versions of GCC) were used on both >> system to compile the program? > > Hi > > on Ubuntu $ gcc -v Using built-in specs. Target: x86_64-linux-gnu Configured > with: ../src/configure -v --with-pkgversion='Ubuntu 4.4.1-4ubuntu9' > --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs > --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --enable-shared > --enable-multiarch --enable-linker-build-id --with-system-zlib > --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix > --with-gxx-include-dir=/usr/include/c++/4.4 --program-suffix=-4.4 > --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc > --disable-werror --with-arch-32=i486 --with-tune=generic > --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu > --target=x86_64-linux-gnu Thread model: posix gcc version 4.4.1 (Ubuntu > 4.4.1-4ubuntu9) > > on FreeBSD % gcc44 -v Using built-in specs. Target: x86_64-portbld-freebsd8.0 > Configured with: ./../gcc-4.4-20100330/configure --disable-nls > --libdir=/usr/local/lib/gcc44 --libexecdir=/usr/local/libexec/gcc44 > --program-suffix=44 --with-as=/usr/local/bin/as --with-gmp=/usr/local > --with-gxx-include-dir=/usr/local/lib/gcc44/include/c++/ > --with-ld=/usr/local/bin/ld --with-libiconv-prefix=/usr/local > --with-system-zlib --disable-libgcj --prefix=/usr/local > --mandir=/usr/local/man --infodir=/usr/local/info/gcc44 > --build=x86_64-portbld-freebsd8.0 Thread model: posix gcc version 4.4.4 > 20100330 (prerelease) (GCC) Is this what was used to compile the code in hot path (the code that performs all the actual calculations)? The answer is not obvious. GCC 4.4 is known to produce better code for modern CPUs, partially because it has knowledge of recently introduced instructions. -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
On Tue, Apr 13, 2010 at 12:35 AM, Andrew Snow wrote: > > The statements about the scheduler flipping between cores is also somewhat > false, ULE does the right thing now for long-running computational threads. > > Furthermore, I can't see how a Gflops benchmark which fits in the CPU cache > has anything to do with the memory architecture of the operating system. > > It can. Search the web for descriptions of page coloring. Roughly speaking, if your cache is physically indexed, the way in which the virtual memory system allocates physical pages to virtual addresses can affect whether or not the cache is fully utilized. In a pathological case, those physical pages that your application touches reside in the same part of the cache and consequently you suffer frequent conflict misses. Meanwhile, the other parts of the cache go unused. Page coloring creates a predictable mapping between virtual and physical addresses so that a carefully written application can avoid the pathological case. Our support for superpages has the same effect. Alan ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
The statements about the scheduler flipping between cores is also somewhat false, ULE does the right thing now for long-running computational threads. Furthermore, I can't see how a Gflops benchmark which fits in the CPU cache has anything to do with the memory architecture of the operating system. I assume to reach these results the benchmark was multi-threaded, and so I think I'd start by looking at the scheduler. Before that I'd probably look at the libraries, how they were compiled, differences in the compiler etc. - Andrew ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
On Sun, Apr 11, 2010 at 11:12 PM, Maho NAKATA wrote: > Hi FreeBSD developers, > [the original article in Japanese can be found at > http://blog.goo.ne.jp/nakatamaho/e/b5f6fbc3cc6e1ac4947463eb1ca4eb0a ] > > *Abstract* > I compared the peak performance of FreeBSD 8.0/amd64 and Ubuntu 9.10 amd64 > using dgemm > (a linear algebra routine, matrix-matrix multiplication). > I obtained only 70% of theoretical peak performance on FreeBSD 8/amd64 and > almost 95% on Ubuntu 9.10 /amd64. I'm really disappointed. > > *Introduction* > I'm a friend of Gotoh Kazushige, the principal developers of GotoBLAS. He > told me that > FreeBSD is not suitable OS for scientific computing or high performance > computing. He says > (in Japanese and my translation): > > > I guess FreeBSD does page coloring, but I don't think FreeBSD considers > very large cache > > size which recent CPU has. Support of a very large cache on Linux is > still not very will > > sophisticated, but on *BSDs, its worst; they uses too fine memory > allocation method, > > so we cannot expect large continuous physical memory allocation. > These statements about FreeBSD's memory management are wrong, or at least outdated. FreeBSD is very likely to allocate physical memory in contiguous chunks to your memory-hungry application even if automatic superpage promotion does not occur. You should refer your friend to my paper at http://www.usenix.org/events/osdi02/tech/full_papers/navarro/navarro_html/and tell him that FreeBSD >= 7.2 implements a variation on what that paper describes. Regards, Alan ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
Maho NAKATA writes: > From: Michael Poole > Subject: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, > Corei7 920 > Date: Mon, 12 Apr 2010 10:06:55 -0400 > >> Nakata-san's theoretical performance numbers assume 4 to 4.2 operations >> per core per cycle at the nominal (2.66 GHz, non-TurboBoost) clock rate. >> (DGEMM is double precision, but I am not familiar enough with scientific >> computing or with the Nehalem implementation of SSE to know why it is >> four operations per cycle rather than two -- is it because double >> precision counts as two FLOPs or is it because of multiple issue?) >> TurboBoost runs up to 2.93 GHz on this CPU, so it doesn't fit either the >> theoretical peak performance or the performance discrepancy very well. > > Hi Michael, > I read http://www.intel.com/support/processors/sb/cs-023143.htm > and TurboBoost on 920 is 2.80GHz. Ah. I was looking at http://ark.intel.com/Product.aspx?id=37147 . Given a 2.80 GHz TurboBoost, the 44.8 GFLOPS theoretical performance number makes sense. I think the more important point is that TurboBoost on this CPU gives at most a 10% speedup, so it cannot explain the 25% performance difference. Michael ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
Hi Bruce, many thanks. I like FreeBSD, esp. ports, since I'm have been a ports committer for 8 years, so I'll do what I can do...First step might be reproducible results and provide better analysis for ports/math/ports/gotoblas. thanks -- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/ Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
Hi, Many thanks for interested in. I used following program to major the FLOPS. I'll provide more in details. you many need but you can change dgemm_f77 to something else to link agianst GotoBLAS (ports/math/gotoblas). I think you can use math/atlas but it takes too long time to compile... --- #include #include #include #include #include #define F77_FUNC(name,NAME) name ## _ #include #define MAXLOOP 10 unsigned long long microseconds() { rusage t; timeval tv; getrusage( RUSAGE_SELF, &t ); tv = t.ru_utime; return ((unsigned long long)tv.tv_sec)*100 + tv.tv_usec; } double gettimeofday_sec() { struct timeval tv; gettimeofday(&tv, NULL); return tv.tv_sec + (double)tv.tv_usec*1e-6; } int main() { int n; int incx = 1, incy = 1; double alpha = 3.14, beta = 2.717; double dgemmtime, t1, t2, t_1, t_2; for (n = 3000 ; n < 1; n=n+100) { printf("n: %d\n", (int)n); double *A = new double[n*n]; double *B = new double[n*n]; double *C = new double[n*n]; for (int i = 0; i < n; i++) { for (int j = 0; j < n; j++) { A[i*n+j] = i * j + 1; B[i*n+j] = (i+1) * (j+1) + 1; C[i*n+j] = (i+1) - (j+1) + 1; } } t1 = (double)microseconds(); t_1 = gettimeofday_sec(); for (int p = 0 ; p < MAXLOOP; p++ ){ dgemm_f77("n", "n", &n, &n, &n, &alpha, A, &n, B, &n, &beta, C, &n); } t2 = (double)microseconds(); t_2 = gettimeofday_sec(); // dgemmtime = (t2 - t1) * 1e-6; dgemmtime = (t_2 - t_1); printf("time : %lf or %lf \n", (t2 - t1) * 1e-6, t_2 - t_1); printf("Mflops : %lf\n", ( 2.0 * (double)n * (double)n * (double)n + 2.0 * (double)n* (double)n )* MAXLOOP / dgemmtime / (1000*1000) ); delete[]C; delete[]B; delete[]A; } } -- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/ Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
Hi Andriy and Jeremy In my case, % sysctl vm.pmap.pg_ps_enabled vm.pmap.pg_ps_enabled: 1 thanks a lot! From: Jeremy Chadwick Subject: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920 Date: Mon, 12 Apr 2010 08:00:23 -0700 > On Mon, Apr 12, 2010 at 05:41:35PM +0300, Andriy Gapon wrote: >> Perhaps, he talks about support of large pages (2M) and related improvements >> in >> TLB performance. If so, he (and you) may read about 'superpages' feature of >> FreeBSD. >> I am not sure if it is enabled by default in 8.0, you can check >> vm.pmap.pg_ps_enabled. > > On 8.0-RELEASE and later, they are. Line 183: > > http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/amd64/amd64/pmap.c?annotate=1.667.2.12 > > Commit where they got enabled by default (approx. 16 months ago): > > http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/amd64/amd64/pmap.c#rev1.646 > > -- > | Jeremy Chadwick j...@parodius.com | > | Parodius Networking http://www.parodius.com/ | > | UNIX Systems Administrator Mountain View, CA, USA | > | Making life hard for others since 1977. PGP: 4BD6C0CB | > > ___ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org" > ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
From: Andriy Gapon Subject: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920 Date: Mon, 12 Apr 2010 17:49:24 +0300 > on 12/04/2010 17:41 Andriy Gapon said the following: >> It would also be get good to learn more about your program. >> How much memory does it typically use, how does it allocate it? >> Is it single-threaded or not? If not, how many threads does it have and >> what do >> they do, how do they communicate? > > Another question is what compilers (what versions of GCC) were used on both > system > to compile the program? Hi on Ubuntu $ gcc -v Using built-in specs. Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Ubuntu 4.4.1-4ubuntu9' --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --enable-shared --enable-multiarch --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.4 --program-suffix=-4.4 --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --disable-werror --with-arch-32=i486 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 4.4.1 (Ubuntu 4.4.1-4ubuntu9) on FreeBSD % gcc44 -v Using built-in specs. Target: x86_64-portbld-freebsd8.0 Configured with: ./../gcc-4.4-20100330/configure --disable-nls --libdir=/usr/local/lib/gcc44 --libexecdir=/usr/local/libexec/gcc44 --program-suffix=44 --with-as=/usr/local/bin/as --with-gmp=/usr/local --with-gxx-include-dir=/usr/local/lib/gcc44/include/c++/ --with-ld=/usr/local/bin/ld --with-libiconv-prefix=/usr/local --with-system-zlib --disable-libgcj --prefix=/usr/local --mandir=/usr/local/man --infodir=/usr/local/info/gcc44 --build=x86_64-portbld-freebsd8.0 Thread model: posix gcc version 4.4.4 20100330 (prerelease) (GCC) thanks -- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/ Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
From: Michael Poole Subject: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920 Date: Mon, 12 Apr 2010 10:06:55 -0400 > Nakata-san's theoretical performance numbers assume 4 to 4.2 operations > per core per cycle at the nominal (2.66 GHz, non-TurboBoost) clock rate. > (DGEMM is double precision, but I am not familiar enough with scientific > computing or with the Nehalem implementation of SSE to know why it is > four operations per cycle rather than two -- is it because double > precision counts as two FLOPs or is it because of multiple issue?) > TurboBoost runs up to 2.93 GHz on this CPU, so it doesn't fit either the > theoretical peak performance or the performance discrepancy very well. Hi Michael, I read http://www.intel.com/support/processors/sb/cs-023143.htm and TurboBoost on 920 is 2.80GHz. > why it is four operations per cycle rather than two It's bit strane to me as well. but I did dgemm operation with m=k=n case and in this case, flop count would become 2n^3 + 2n^2 (even 2n^3 is okay). thanks -- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/ Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
Hi Antony I think this is not the case. I tested TurboBoost on/off on Ubuntu, GotoBLAS achieved 95% of theoretical perfomance for both cases. cf. http://www.intel.com/support/processors/sb/cs-023143.htm and http://blog.goo.ne.jp/nakatamaho/e/86c0f4ac529fd5b530454ed795e6b466 (written in Japanese, tho) Thanks From: Antony Mawer Subject: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920 Date: Mon, 12 Apr 2010 23:58:17 +1000 > This may well be the same sort of issue that was discussed in this thread > here: > > http://lists.freebsd.org/pipermail/freebsd-hackers/2010-March/031004.html > > In short, the Core i7 CPUs have a feature called "TurboBoost" where > the clock speed of one or more cores is boosted when other cores are > idle and in a C2 or C3 sleep status ... if the appropriate power > saving mode isn't active on the system (which I don't think FreeBSD > does by default?), the idle cores are never put into the appropriate > power saving state, and as a result TurboBoost never kicks in... > > It _may_ be that Ubuntu configures this correctly whereas FreeBSD does > not (out of the box)? > > Of course it may be something else entirely, but worth checking out... > > --Antony > > On Mon, Apr 12, 2010 at 7:31 PM, Adrian Chadd wrote: >> Of course, what would be helpful is actually figuring out what is >> going on rather than some conjecture. :) >> >> With what he said, tweaking memory allocation under FreeBSD and/or >> linux would change the performance characteristics and either validate >> or disprove his assumptions? >> >> >> Adrian >> >> On 12 April 2010 12:12, Maho NAKATA wrote: >>> Hi FreeBSD developers, >>> [the original article in Japanese can be found at >>> http://blog.goo.ne.jp/nakatamaho/e/b5f6fbc3cc6e1ac4947463eb1ca4eb0a ] >>> >>> *Abstract* >>> I compared the peak performance of FreeBSD 8.0/amd64 and Ubuntu 9.10 amd64 >>> using dgemm >>> (a linear algebra routine, matrix-matrix multiplication). >>> I obtained only 70% of theoretical peak performance on FreeBSD 8/amd64 and >>> almost 95% on Ubuntu 9.10 /amd64. I'm really disappointed. >>> >>> *Introduction* >>> I'm a friend of Gotoh Kazushige, the principal developers of GotoBLAS. He >>> told me that >>> FreeBSD is not suitable OS for scientific computing or high performance >>> computing. He says >>> (in Japanese and my translation): >>> >>>> I guess FreeBSD does page coloring, but I don't think FreeBSD considers >>>> very large cache >>>> size which recent CPU has. Support of a very large cache on Linux is still >>>> not very will >>>> sophisticated, but on *BSDs, its worst; they uses too fine memory >>>> allocation method, >>>> so we cannot expect large continuous physical memory allocation. >>>> Moreover, process scheduling is not so nice as *BSD employs an algorithm >>>> that >>>> changes physical CPUs in turn instead of allocating one core for such kind >>>> of jobs. >>>> Take your own benchmark, and you'll see.. >>> >>> *Result* >>> Machine: Core i7 920 (42.56-44.8Gflops) / DDR3 1066 >>> OS: FreeBSD 8.0/amd64 and Ubuntu 9.10 >>> GotoBLAS2: 1.13 >>> >>> dgemm result >>> OS : FLOPS : percent in peak >>> FreeBSD : 32.0 GFlops : 71% >>> Ubuntu : 42.0-42.7GFlops : 93.8%-95.3% >>> >>> Thanks, >>> -- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/ >>> Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt >>> >>> >>> >>> >>> ___ >>> freebsd-stable@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-stable >>> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org" >>> >> ___ >> freebsd-stable@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-stable >> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org" >> > ___ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org" > ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
Hi Bruce, From: Bruce Simpson Subject: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920 Date: Mon, 12 Apr 2010 10:49:14 +0100 > So, where's the profiling to discover why this is the case? Ok I'll provide better documentation so that everyone can test it very clearly. (may take some time...) > Also I'm not clear on what constitutes 'theoretical peak performance' > here or how it is being calculated. So figures like these come across > as unscientific. Core i7 920 (2.66GHz) constitutes four cores. each core has four floating point operators. thus; 2.66GHz x 4 x 4 = 42.56Gflops cf. http://www.intel.com/support/processors/sb/cs-023143.htm > I'm sure this is something which can be resolved if someone sits down, > profiles the app, and makes the necessary adjustments > (e.g. pthread_setaffinity_np()) to configure CPU affinity, if the lack > of it is pessimizing your friend's app. might be. we run on the same machine. > The PMC framework is rapidly maturing, and you can use KCacheGrind > with it to visualize context switch overhead. > > But I think it's expecting a bit much to post informal results to > -stable, in an expectation of something other thaninformal suggestions > of what may help someone's maths-intensive application. BLAS is a basic linear algebra package which is used many applications. It is also used for top500 http://www.top500.org/ cf. http://www.top500.org/project/introduction via LINPACK. dgemm is LEVEL 3 BLAS, which is a very good for common PCs as calculation is CPU intensive. > If there are performance issues, then reproducible results are needed, > as well as some basic profiling effort of the system elements > involved, before people could say anything either way, or offer > further help. again, I'll provide better documentation so that everyone can test it very clearly. (may take some time...) thanks, -- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/ Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
Hi all, There's a port archivers/pbzip2, and I am inclined to believe this is a good benchmark for multi-core performance in real-world usage (with an appropriate input data set). BZIP2 is a compression algorithm which is readily applicable to multicore, because of the nature in which its workload may be partioned amongst multiple CPU cores. It block-sorts, and it can compress long runs of input data independently of other CPU threads. When I used PBZIP2 informally back in January, before advising on FreeBSD/Xen, I saw largely the results I'd expect to see from such a workload, and didn't encounter pessimization of benchmark figures. Informal tests were performed on 8-STABLE at that time. The OP may well be looking for Newton-Raphson approximations, to the derivatives involved in his friend's linear algebra system. The point is that PBZIP2 would also exercise context switches in a real-life workload. I'd be concerned, as anyone else would be, about benchmarks which apparently challenge FreeBSD's ability to tackle significant mathematical workloads. But from what little I understand, from speaking to David Schultz and others who have been involved with FreeBSD's floating point performance, on a scientific basis -- without a scientifically reproducible experiment, I don't see a problem. Obviously, I am concerned that Nakata-san observes what he regards to be a problem, and would like to help any way I can. cheers, BMS ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
On Mon, Apr 12, 2010 at 05:41:35PM +0300, Andriy Gapon wrote: > Perhaps, he talks about support of large pages (2M) and related improvements > in > TLB performance. If so, he (and you) may read about 'superpages' feature of > FreeBSD. > I am not sure if it is enabled by default in 8.0, you can check > vm.pmap.pg_ps_enabled. On 8.0-RELEASE and later, they are. Line 183: http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/amd64/amd64/pmap.c?annotate=1.667.2.12 Commit where they got enabled by default (approx. 16 months ago): http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/amd64/amd64/pmap.c#rev1.646 -- | Jeremy Chadwick j...@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
on 12/04/2010 17:41 Andriy Gapon said the following: > It would also be get good to learn more about your program. > How much memory does it typically use, how does it allocate it? > Is it single-threaded or not? If not, how many threads does it have and what > do > they do, how do they communicate? Another question is what compilers (what versions of GCC) were used on both system to compile the program? -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
on 12/04/2010 07:12 Maho NAKATA said the following: > Hi FreeBSD developers, > [the original article in Japanese can be found at > http://blog.goo.ne.jp/nakatamaho/e/b5f6fbc3cc6e1ac4947463eb1ca4eb0a ] > > *Abstract* > I compared the peak performance of FreeBSD 8.0/amd64 and Ubuntu 9.10 amd64 > using dgemm > (a linear algebra routine, matrix-matrix multiplication). > I obtained only 70% of theoretical peak performance on FreeBSD 8/amd64 and > almost 95% on Ubuntu 9.10 /amd64. I'm really disappointed. Sorry about that, but more important question (for us) is: are you willing to help us improve in addition to reporting your results? > *Introduction* > I'm a friend of Gotoh Kazushige, the principal developers of GotoBLAS. He > told me that > FreeBSD is not suitable OS for scientific computing or high performance > computing. He says > (in Japanese and my translation): > >> I guess FreeBSD does page coloring, but I don't think FreeBSD considers very >> large cache >> size which recent CPU has. AFAIK, recent FreeBSD doesn't use page coloring anymore. >> Support of a very large cache on Linux is still not very will >> sophisticated, but on *BSDs, its worst; they uses too fine memory allocation >> method, >> so we cannot expect large continuous physical memory allocation. Can your friend provide more explanation about these points in technical terms? E.g. what kind of support, in his opinion, is needed for very large caches? Why, in his opinion, the memory needs to be physically contiguous? Perhaps, he talks about support of large pages (2M) and related improvements in TLB performance. If so, he (and you) may read about 'superpages' feature of FreeBSD. I am not sure if it is enabled by default in 8.0, you can check vm.pmap.pg_ps_enabled. >> Moreover, process scheduling is not so nice as *BSD employs an algorithm that >> changes physical CPUs in turn instead of allocating one core for such kind >> of jobs. >> Take your own benchmark, and you'll see.. Here I can only add an anecdotal 'me too'. Sometimes I run single-threaded high-cpu programs like ffmpeg transcoding on otherwise idle system (a bunch of system daemons in background). And I see that the cpu-consuming process frequently goes back and forth between my two cores. CPU user loads on the cores are something like 60% vs 40%. My expectations were that the process would mostly run on one core while the rest of the threads would mostly run on the other. I am not sure if that core switching really hurts performance and if there is something wrong about it. But somehow it seems 'counter-intuitive'. > *Result* > Machine: Core i7 920 (42.56-44.8Gflops) / DDR3 1066 > OS: FreeBSD 8.0/amd64 and Ubuntu 9.10 > GotoBLAS2: 1.13 > > dgemm result > OS : FLOPS : percent in peak > FreeBSD : 32.0 GFlops : 71% > Ubuntu : 42.0-42.7GFlops : 93.8%-95.3% It would also be get good to learn more about your program. How much memory does it typically use, how does it allocate it? Is it single-threaded or not? If not, how many threads does it have and what do they do, how do they communicate? -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
Antony Mawer writes: > This may well be the same sort of issue that was discussed in this thread > here: > > http://lists.freebsd.org/pipermail/freebsd-hackers/2010-March/031004.html > > In short, the Core i7 CPUs have a feature called "TurboBoost" where > the clock speed of one or more cores is boosted when other cores are > idle and in a C2 or C3 sleep status ... if the appropriate power > saving mode isn't active on the system (which I don't think FreeBSD > does by default?), the idle cores are never put into the appropriate > power saving state, and as a result TurboBoost never kicks in... > > It _may_ be that Ubuntu configures this correctly whereas FreeBSD does > not (out of the box)? > > Of course it may be something else entirely, but worth checking out... Nakata-san's theoretical performance numbers assume 4 to 4.2 operations per core per cycle at the nominal (2.66 GHz, non-TurboBoost) clock rate. (DGEMM is double precision, but I am not familiar enough with scientific computing or with the Nehalem implementation of SSE to know why it is four operations per cycle rather than two -- is it because double precision counts as two FLOPs or is it because of multiple issue?) TurboBoost runs up to 2.93 GHz on this CPU, so it doesn't fit either the theoretical peak performance or the performance discrepancy very well. Michael Poole ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
This may well be the same sort of issue that was discussed in this thread here: http://lists.freebsd.org/pipermail/freebsd-hackers/2010-March/031004.html In short, the Core i7 CPUs have a feature called "TurboBoost" where the clock speed of one or more cores is boosted when other cores are idle and in a C2 or C3 sleep status ... if the appropriate power saving mode isn't active on the system (which I don't think FreeBSD does by default?), the idle cores are never put into the appropriate power saving state, and as a result TurboBoost never kicks in... It _may_ be that Ubuntu configures this correctly whereas FreeBSD does not (out of the box)? Of course it may be something else entirely, but worth checking out... --Antony On Mon, Apr 12, 2010 at 7:31 PM, Adrian Chadd wrote: > Of course, what would be helpful is actually figuring out what is > going on rather than some conjecture. :) > > With what he said, tweaking memory allocation under FreeBSD and/or > linux would change the performance characteristics and either validate > or disprove his assumptions? > > > Adrian > > On 12 April 2010 12:12, Maho NAKATA wrote: >> Hi FreeBSD developers, >> [the original article in Japanese can be found at >> http://blog.goo.ne.jp/nakatamaho/e/b5f6fbc3cc6e1ac4947463eb1ca4eb0a ] >> >> *Abstract* >> I compared the peak performance of FreeBSD 8.0/amd64 and Ubuntu 9.10 amd64 >> using dgemm >> (a linear algebra routine, matrix-matrix multiplication). >> I obtained only 70% of theoretical peak performance on FreeBSD 8/amd64 and >> almost 95% on Ubuntu 9.10 /amd64. I'm really disappointed. >> >> *Introduction* >> I'm a friend of Gotoh Kazushige, the principal developers of GotoBLAS. He >> told me that >> FreeBSD is not suitable OS for scientific computing or high performance >> computing. He says >> (in Japanese and my translation): >> >>> I guess FreeBSD does page coloring, but I don't think FreeBSD considers >>> very large cache >>> size which recent CPU has. Support of a very large cache on Linux is still >>> not very will >>> sophisticated, but on *BSDs, its worst; they uses too fine memory >>> allocation method, >>> so we cannot expect large continuous physical memory allocation. >>> Moreover, process scheduling is not so nice as *BSD employs an algorithm >>> that >>> changes physical CPUs in turn instead of allocating one core for such kind >>> of jobs. >>> Take your own benchmark, and you'll see.. >> >> *Result* >> Machine: Core i7 920 (42.56-44.8Gflops) / DDR3 1066 >> OS: FreeBSD 8.0/amd64 and Ubuntu 9.10 >> GotoBLAS2: 1.13 >> >> dgemm result >> OS : FLOPS : percent in peak >> FreeBSD : 32.0 GFlops : 71% >> Ubuntu : 42.0-42.7GFlops : 93.8%-95.3% >> >> Thanks, >> -- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/ >> Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt >> >> >> >> >> ___ >> freebsd-stable@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-stable >> To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org" >> > ___ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org" > ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
Of course, what would be helpful is actually figuring out what is going on rather than some conjecture. :) With what he said, tweaking memory allocation under FreeBSD and/or linux would change the performance characteristics and either validate or disprove his assumptions? Adrian On 12 April 2010 12:12, Maho NAKATA wrote: > Hi FreeBSD developers, > [the original article in Japanese can be found at > http://blog.goo.ne.jp/nakatamaho/e/b5f6fbc3cc6e1ac4947463eb1ca4eb0a ] > > *Abstract* > I compared the peak performance of FreeBSD 8.0/amd64 and Ubuntu 9.10 amd64 > using dgemm > (a linear algebra routine, matrix-matrix multiplication). > I obtained only 70% of theoretical peak performance on FreeBSD 8/amd64 and > almost 95% on Ubuntu 9.10 /amd64. I'm really disappointed. > > *Introduction* > I'm a friend of Gotoh Kazushige, the principal developers of GotoBLAS. He > told me that > FreeBSD is not suitable OS for scientific computing or high performance > computing. He says > (in Japanese and my translation): > >> I guess FreeBSD does page coloring, but I don't think FreeBSD considers very >> large cache >> size which recent CPU has. Support of a very large cache on Linux is still >> not very will >> sophisticated, but on *BSDs, its worst; they uses too fine memory allocation >> method, >> so we cannot expect large continuous physical memory allocation. >> Moreover, process scheduling is not so nice as *BSD employs an algorithm that >> changes physical CPUs in turn instead of allocating one core for such kind >> of jobs. >> Take your own benchmark, and you'll see.. > > *Result* > Machine: Core i7 920 (42.56-44.8Gflops) / DDR3 1066 > OS: FreeBSD 8.0/amd64 and Ubuntu 9.10 > GotoBLAS2: 1.13 > > dgemm result > OS : FLOPS : percent in peak > FreeBSD : 32.0 GFlops : 71% > Ubuntu : 42.0-42.7GFlops : 93.8%-95.3% > > Thanks, > -- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/ > Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt > > > > > ___ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org" > ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
On 04/12/10 05:12, Maho NAKATA wrote: *Abstract* I compared the peak performance of FreeBSD 8.0/amd64 and Ubuntu 9.10 amd64 using dgemm (a linear algebra routine, matrix-matrix multiplication). I obtained only 70% of theoretical peak performance on FreeBSD 8/amd64 and almost 95% on Ubuntu 9.10 /amd64. I'm really disappointed. So, where's the profiling to discover why this is the case? Also I'm not clear on what constitutes 'theoretical peak performance' here or how it is being calculated. So figures like these come across as unscientific. I'm sure this is something which can be resolved if someone sits down, profiles the app, and makes the necessary adjustments (e.g. pthread_setaffinity_np()) to configure CPU affinity, if the lack of it is pessimizing your friend's app. The PMC framework is rapidly maturing, and you can use KCacheGrind with it to visualize context switch overhead. But I think it's expecting a bit much to post informal results to -stable, in an expectation of something other thaninformal suggestions of what may help someone's maths-intensive application. If there are performance issues, then reproducible results are needed, as well as some basic profiling effort of the system elements involved, before people could say anything either way, or offer further help. cheers, BMS ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920
On Sun, Apr 11, 2010 at 9:12 PM, Maho NAKATA wrote: > Hi FreeBSD developers, > [the original article in Japanese can be found at > http://blog.goo.ne.jp/nakatamaho/e/b5f6fbc3cc6e1ac4947463eb1ca4eb0a ] > > *Abstract* > I compared the peak performance of FreeBSD 8.0/amd64 and Ubuntu 9.10 amd64 > using dgemm > (a linear algebra routine, matrix-matrix multiplication). > I obtained only 70% of theoretical peak performance on FreeBSD 8/amd64 and > almost 95% on Ubuntu 9.10 /amd64. I'm really disappointed. > > *Introduction* > I'm a friend of Gotoh Kazushige, the principal developers of GotoBLAS. He > told me that > FreeBSD is not suitable OS for scientific computing or high performance > computing. He says > (in Japanese and my translation): > >> I guess FreeBSD does page coloring, but I don't think FreeBSD considers very >> large cache >> size which recent CPU has. Support of a very large cache on Linux is still >> not very will >> sophisticated, but on *BSDs, its worst; they uses too fine memory allocation >> method, >> so we cannot expect large continuous physical memory allocation. >> Moreover, process scheduling is not so nice as *BSD employs an algorithm that >> changes physical CPUs in turn instead of allocating one core for such kind >> of jobs. >> Take your own benchmark, and you'll see.. > > *Result* > Machine: Core i7 920 (42.56-44.8Gflops) / DDR3 1066 > OS: FreeBSD 8.0/amd64 and Ubuntu 9.10 > GotoBLAS2: 1.13 > > dgemm result > OS : FLOPS : percent in peak > FreeBSD : 32.0 GFlops : 71% > Ubuntu : 42.0-42.7GFlops : 93.8%-95.3% I'm not sure if this is the exact issue, but it might be a point of reference worth investigating: http://lists.freebsd.org/pipermail/freebsd-hackers/2010-March/031004.html Thanks, -Garrett ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"