Re: HyperThreading makes worse to me (was Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920)

2010-04-15 Thread Garrett Cooper
On Wed, Apr 14, 2010 at 9:21 PM, Ian Smith smi...@nimnet.asn.au wrote:
 On Wed, 14 Apr 2010, Garrett Cooper wrote:
   On Wed, Apr 14, 2010 at 7:49 PM, Garrett Cooper yanef...@gmail.com wrote:
    On Wed, Apr 14, 2010 at 5:46 PM, Maho NAKATA cha...@mac.com wrote:
    Hi Andry and Adam
   
    My test again. No desktop, etc. I just run dgemm.
    Contrary to Adam's result, Hyper Threading makes the performance worse.
    all tests are done on Core i7 920 @ 2.67GHz. (TurboBoost @2.8GHz)
   
    Turbo Boost off, Hyper threading off: 82% (35GFlops)    [1]
    Turbo Boost off, Hyper threading off: 72% (30.5GFlops)  [2]

 Er, shouldn't one of those say HTT on?  and/or Turbo boost on?  Else
 they're both the same test as [4] but with different results?

There's a problem with 8.x+ cores reported by the kernel. For some odd
reason more recent Intel processors aren't reporting themselves as
HT-enabled when they have HT-cores (see: kern/145385).

I didn't look into the issue too hard, but since it does seem to be a
major performance loss perhaps I should; besides, it would be good
experience to put under my belt :].

    Turbo Boost on,  Hyper threading on: 71% (32GFlops)    [3]
    Turbo Boost off, Hyper threading off: 84-89% (38-40GFlops) [4]

 Clarification of all four possible test configs - 8 if you add pinning
 CPUs or not - might make this a bit clearer?

    Doesn't this make sense? Hyperthreaded cores in Intel procs still
    provide an incomplete set of registers as they're logical processors,
    so I would expect for things to be slower if they're automatically run
    on the SMT cores instead of the physical ones.

 Since we're talking FP, do HTT 'cores' share an FPU, or have their own?
 If contended, you'd have to expect worse (at least FP) performance, no?

   Ah, that's another excellent point. What instructions is dgemm
using -- pure integer based arithmetic, floating point arithmetic,
specialized operations that would benefit from using SIMD, etc?

    Is there a weighting scheme to SCHED_ULE where logical processors
    (like the SMT variety) get a lower score than real processors do, and
    thus get scheduled for less intensive interrupting tasks, or maybe
    just don't get scheduled in high use scenarios like it would if it was
    a physical processor?
  
   Err... wait. Didn't see that the turbo boost results didn't scale
   linearly or align with one another until just a sec ago. Nevermind my
   previous comment.

 Waiting for the fog to lift ..

As am I. I don't know enough in this area, but I'm definitely open
to learning.

Thanks,
-Garrett
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Linux static linked ver doesn't work on FBSD (Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920)

2010-04-15 Thread Garrett Cooper
On Wed, Apr 14, 2010 at 10:26 PM, Maho NAKATA cha...@mac.com wrote:
 From: Pieter de Goeje pie...@degoeje.nl
 Subject: Re: How to reproduce: Re: Only 70% of theoretical peak performance 
 on FreeBSD 8/amd64, Corei7 920
 Date: Wed, 14 Apr 2010 16:05:18 +0200

 I think the best test would be to run a statically compiled linux binary on
 FreeBSD. That way the compiler settings are exactly the same.

 It is not possible for Linux amd64 binary to run on FreeBSD amd64,
 ...and not i386 version neither. GotoBLAS uses special systeml call.

 % ./dgemm
 linux_sys_futex: unknown op 265
 linux: pid 1264 (dgemm): syscall mbind not implemented
 n: 3000
 ^C
 just halt.

Yes, and while this isn't directly tied into numa, mbind(2),
mempolicy(2), and a few others use the same facilities that are
available via plain numa. I know because of messes I've tried to clean
up in these areas. I'm really not sure why this is using numa though
to be honest...
Thanks,
-Garrett
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: HyperThreading makes worse to me (was Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920)

2010-04-15 Thread Adrian Chadd
May I make a suggestion?

Would you mind creating a shared google spreadsheet with your testing
results and a shared google document with the test setup?

I think having the data in an easily represented, easily shared medium
would be beneficial to everyone.


Adrian

On 15 April 2010 08:46, Maho NAKATA cha...@mac.com wrote:
 Hi Andry and Adam

 My test again. No desktop, etc. I just run dgemm.
 Contrary to Adam's result, Hyper Threading makes the performance worse.
 all tests are done on Core i7 920 @ 2.67GHz. (TurboBoost @2.8GHz)

 Turbo Boost off, Hyper threading off: 82% (35GFlops)    [1]
 Turbo Boost off, Hyper threading off: 72% (30.5GFlops)  [2]

 Turbo Boost on,  Hyper threading on: 71% (32GFlops)    [3]
 Turbo Boost off, Hyper threading off: 84-89% (38-40GFlops) [4]

 ---my system---
 CPU: Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz (2683.44-MHz K8-class 
 CPU)
  Origin = GenuineIntel  Id = 0x106a5  Stepping = 5
  Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE
  Features2=0x98e3bdSSE3,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,SSE4.1,SSE4.2,POPCNT
  AMD Features=0x28100800SYSCALL,NX,RDTSCP,LM
  AMD Features2=0x1LAHF
  TSC: P-state invariant
 real memory  = 12884901888 (12288 MB)
 avail memory = 12387717120 (11813 MB)
 ACPI APIC Table: 110909 APIC1026
 FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
 FreeBSD/SMP: 1 package(s) x 4 core(s)
 ---my system---

 ---DETAILS---
 [1]
 % ./dgemm
 n: 3000
 time : 57.666717 or 16.339074
 Mflops : 33060.624827
 n: 3100
 time : 61.502677 or 16.597376
 Mflops : 35910.025544
 n: 3200
 time : 69.075401 or 19.199833
 Mflops : 34144.297133
 n: 3300
 time : 73.699540 or 19.633594
 Mflops : 36618.756539
 n: 3400
 time : 82.256194 or 22.373651
 Mflops : 35144.518837
 n: 3500
 time : 88.975662 or 24.118761
 Mflops : 35563.394249
 n: 3600
 time : 96.436652 or 26.027588
 Mflops : 35861.148385
 n: 3700
 [2]
 % ./dgemm
 n: 3000
 time : 139.622739 or 17.693806
 Mflops : 30529.327312
 n: 3100
 time : 154.344971 or 19.566886
 Mflops : 30460.247702
 n: 3200
 time : 169.507739 or 21.467100
 Mflops : 30538.116602
 n: 3300
 time : 186.363773 or 23.615281
 Mflops : 30444.600545
 n: 3400
 time : 203.798979 or 25.817667
 Mflops : 30456.322788
 n: 3500
 ...
 [3]
 % ./dgemm
 n: 3000
 time : 134.673079 or 16.958682
 Mflops : 31852.711082
 n: 3100
 time : 148.410085 or 18.663248
 Mflops : 31935.073574
 n: 3200
 time : 162.835473 or 20.468825
 Mflops : 32027.475770
 n: 3300
 time : 179.025370 or 22.479189
 Mflops : 31983.262501
 n: 3400
 time : 195.859710 or 24.663009
 Mflops : 31882.208788
 n: 3500
 [4]
 % ./dgemm
 n: 3000
 time : 54.259647 or 14.684309
 Mflops : 36786.204907
 n: 3100
 time : 60.899147 or 17.124599
 Mflops : 34804.447141
 n: 3200
 time : 64.295342 or 17.490787
 Mflops : 37480.577569
 n: 3300
 time : 69.781247 or 18.288840
 Mflops : 39311.284796
 n: 3400
 time : 79.234397 or 21.829736
 Mflops : 36020.187858
 n: 3500
 time : 83.905419 or 22.381237
 Mflops : 38324.289174
 n: 3600
 time : 92.195022 or 25.105942
 Mflops : 37177.621122
 n: 3700
 time : 97.718841 or 25.434243
 Mflops : 39841.319494
 n: 3800
 time : 105.740463 or 27.414029
 Mflops : 40042.592613
 n: 3900
 time : 113.980157 or 29.678505
 Mflops : 39984.635420
 n: 4000
 time : 122.941569 or 31.946174
 Mflops : 40077.412531
 n: 4100
 ---DETAILS---


 From: Adam Vande More amvandem...@gmail.com
 Subject: Re: How to reproduce: Re: Only 70% of theoretical peak performance 
 on FreeBSD 8/amd64, Corei7 920
 Date: Wed, 14 Apr 2010 11:34:45 -0500

  time : 162.45 or 20.430651
  Mflops : 32087.318295
  n: 3300
  time : 178.497079 or 22.446093
  Mflops : 32030.420499
  n: 3400
  time : 195.550715 or 24.586152
  Mflops : 31981.873273
  n: 3500
  time : 213.403379 or 26.825058
  Mflops : 31975.513363
  n: 3600
  ...
  above output is on Core i7 920 (2.66GHz; TurboBoost on)

 My results:
 $ ./dgemm
 n: 3000
 time : 54.151302 or 28.189781
 Mflops : 19162.263125
 n: 3100
 time : 60.157449 or 32.214141
 Mflops : 18501.570537
 n: 3200
 time : 65.753191 or 34.114872
 Mflops : 19216.393378

 CPU:
 CPU: Intel(R) Core(TM)2 Duo CPU     E7300  @ 2.66GHz (2653.35-MHz K8-class
 CPU)
  Origin = GenuineIntel  Id = 0x10676  Stepping = 6


 Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE

  Features2=0x8e39dSSE3,DTES64,MON,DS_CPL,EST,TM2,SSSE3,CX16,xTPR,PDCM,SSE4.1
  AMD Features=0x20100800SYSCALL,NX,LM
  AMD Features2=0x1LAHF
  TSC: P-state invariant
 ⋮
 FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
 FreeBSD/SMP: 1 package(s) x 2 core(s)

 FreeBSD:
 FreeBSD 8.0-STABLE r205070 amd64

 Please note that the system was not dedicated to the test, I had
 Xorg+KDE3+thunderbird+skype+kopete+konsole(s) plus a bunch of daemons
 running.
 That probably explains irregularities in the results.

 I am not sure how exactly theoretical maximum should

Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-15 Thread Andriy Gapon
on 14/04/2010 20:47 Adam Vande More said the following:
 I'm no expert Andriy, but it seems like if gotoblas
 implemented some of the FreeBSD optimizations then we'd be in the same
 ballpark.

This is a good point.
But on the other hand, it means that our scheduler doesn't do a perfect job
here.  BTW, I use ULE.
My observation is that when a number of CPU-intensive long running processes is
less than or equal to number of cores, then the processes tend to stay on the
same cores for a long time.
But if the number of the processes is greater, then they seem to jump from core
to core a lot.
But I am not sure what would be an optimal strategy for that case.  If we try to
keep some lucky processes on the same core, then cpu time might be shared
unfairly.  Shuffling cores provides more fairness, but can hurt total 
performance.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-15 Thread Adam Vande More
On Thu, Apr 15, 2010 at 3:54 AM, Andriy Gapon a...@freebsd.org wrote:

 This is a good point.
 But on the other hand, it means that our scheduler doesn't do a perfect job
 here.  BTW, I use ULE.
 My observation is that when a number of CPU-intensive long running
 processes is
 less than or equal to number of cores, then the processes tend to stay on
 the
 same cores for a long time.
 But if the number of the processes is greater, then they seem to jump from
 core
 to core a lot.
 But I am not sure what would be an optimal strategy for that case.  If we
 try to
 keep some lucky processes on the same core, then cpu time might be shared
 unfairly.  Shuffling cores provides more fairness, but can hurt total
 performance.


Is is possible to add a tunable to the scheduler for it's aggressiveness in
switching cores?

-- 
Adam Vande More
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-15 Thread Andriy Gapon
on 15/04/2010 16:23 Adam Vande More said the following:
 Is is possible to add a tunable to the scheduler for it's aggressiveness
 in switching cores?

No idea; not a scheduler person.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-14 Thread Andriy Gapon
on 14/04/2010 02:21 Maho NAKATA said the following:
 2. install ports/math/gotoblas (manual download required)
  make install 


Do you know how gotoblas on Linux was obtained?
Was it built from source?
Has it come pre-packaged?
If so, can you find out details of its build configuration?

Thanks!
-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-14 Thread Pieter de Goeje
On Wednesday 14 April 2010 15:19:13 Andriy Gapon wrote:
 on 14/04/2010 02:21 Maho NAKATA said the following:
  2. install ports/math/gotoblas (manual download required)
   make install
 
 Do you know how gotoblas on Linux was obtained?
 Was it built from source?
 Has it come pre-packaged?
 If so, can you find out details of its build configuration?
 
 Thanks!

I think the best test would be to run a statically compiled linux binary on 
FreeBSD. That way the compiler settings are exactly the same.

- Pieter
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-14 Thread Andriy Gapon
on 14/04/2010 02:21 Maho NAKATA said the following:
 4. run dgemm. 
 % ./dgemm
 n: 3000
 time : 134.648208 or 16.910525 
 Mflops : 31943.419695
 n: 3100
 time : 148.122279 or 18.615284 
 Mflops : 32017.357408
 n: 3200
 time : 162.45 or 20.430651 
 Mflops : 32087.318295
 n: 3300
 time : 178.497079 or 22.446093 
 Mflops : 32030.420499
 n: 3400
 time : 195.550715 or 24.586152 
 Mflops : 31981.873273
 n: 3500
 time : 213.403379 or 26.825058 
 Mflops : 31975.513363
 n: 3600
 ...
 above output is on Core i7 920 (2.66GHz; TurboBoost on)

My results:
$ ./dgemm
n: 3000
time : 54.151302 or 28.189781
Mflops : 19162.263125
n: 3100
time : 60.157449 or 32.214141
Mflops : 18501.570537
n: 3200
time : 65.753191 or 34.114872
Mflops : 19216.393378

CPU:
CPU: Intel(R) Core(TM)2 Duo CPU E7300  @ 2.66GHz (2653.35-MHz K8-class CPU)
  Origin = GenuineIntel  Id = 0x10676  Stepping = 6

Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE
  Features2=0x8e39dSSE3,DTES64,MON,DS_CPL,EST,TM2,SSSE3,CX16,xTPR,PDCM,SSE4.1
  AMD Features=0x20100800SYSCALL,NX,LM
  AMD Features2=0x1LAHF
  TSC: P-state invariant
⋮
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
FreeBSD/SMP: 1 package(s) x 2 core(s)

FreeBSD:
FreeBSD 8.0-STABLE r205070 amd64

Please note that the system was not dedicated to the test, I had
Xorg+KDE3+thunderbird+skype+kopete+konsole(s) plus a bunch of daemons running.
That probably explains irregularities in the results.

I am not sure how exactly theoretical maximum should be calculated, I used 2 *
2.66G * 4 ≈ 21.3G.
And so 19.2G / 21.3G ≈ 90%.

Not as bad as what you get.
Although not as good as what you report for Linux.
But given the impurity and imprecision of my test…

P.S. the machine is two-core obviously :-)
Don't have anything with more cpus/cores handy.

P.P.S. Having _only glimpsed_ at the source I think that there are some things
that GotoBLAS doesn't try to do on FreeBSD that it tries to do on Linux.
Like setting CPU-affinity for the threads, or avoiding HTT pseudo-cores.
Those things are possible on FreeBSD.
Perhaps, there are more things like that.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-14 Thread Adam Vande More
On Wed, Apr 14, 2010 at 10:26 AM, Andriy Gapon a...@freebsd.org wrote:

 on 14/04/2010 02:21 Maho NAKATA said the following:
  4. run dgemm.
  % ./dgemm
  n: 3000
  time : 134.648208 or 16.910525
  Mflops : 31943.419695
  n: 3100
  time : 148.122279 or 18.615284
  Mflops : 32017.357408
  n: 3200
  time : 162.45 or 20.430651
  Mflops : 32087.318295
  n: 3300
  time : 178.497079 or 22.446093
  Mflops : 32030.420499
  n: 3400
  time : 195.550715 or 24.586152
  Mflops : 31981.873273
  n: 3500
  time : 213.403379 or 26.825058
  Mflops : 31975.513363
  n: 3600
  ...
  above output is on Core i7 920 (2.66GHz; TurboBoost on)

 My results:
 $ ./dgemm
 n: 3000
 time : 54.151302 or 28.189781
 Mflops : 19162.263125
 n: 3100
 time : 60.157449 or 32.214141
 Mflops : 18501.570537
 n: 3200
 time : 65.753191 or 34.114872
 Mflops : 19216.393378

 CPU:
 CPU: Intel(R) Core(TM)2 Duo CPU E7300  @ 2.66GHz (2653.35-MHz K8-class
 CPU)
  Origin = GenuineIntel  Id = 0x10676  Stepping = 6


 Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE

  Features2=0x8e39dSSE3,DTES64,MON,DS_CPL,EST,TM2,SSSE3,CX16,xTPR,PDCM,SSE4.1
  AMD Features=0x20100800SYSCALL,NX,LM
  AMD Features2=0x1LAHF
  TSC: P-state invariant
 ⋮
 FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
 FreeBSD/SMP: 1 package(s) x 2 core(s)

 FreeBSD:
 FreeBSD 8.0-STABLE r205070 amd64

 Please note that the system was not dedicated to the test, I had
 Xorg+KDE3+thunderbird+skype+kopete+konsole(s) plus a bunch of daemons
 running.
 That probably explains irregularities in the results.

 I am not sure how exactly theoretical maximum should be calculated, I used
 2 *
 2.66G * 4 ≈ 21.3G.
 And so 19.2G / 21.3G ≈ 90%.

 Not as bad as what you get.
 Although not as good as what you report for Linux.
 But given the impurity and imprecision of my test…

 P.S. the machine is two-core obviously :-)
 Don't have anything with more cpus/cores handy.

 P.P.S. Having _only glimpsed_ at the source I think that there are some
 things
 that GotoBLAS doesn't try to do on FreeBSD that it tries to do on Linux.
 Like setting CPU-affinity for the threads, or avoiding HTT pseudo-cores.
 Those things are possible on FreeBSD.
 Perhaps, there are more things like that.


Mine is also a live desktop enviro, kde4+

n: 3000
time : 116.377609 or 16.696066
Mflops : 32353.729042
n: 3100
time : 127.230336 or 17.274867
Mflops : 34501.695325
n: 3200
time : 139.018175 or 18.342056
Mflops : 35741.074976
n: 3300
time : 152.519365 or 20.154714
Mflops : 35671.942364
n: 3400
time : 166.248145 or 21.952426
Mflops : 35818.874941
n: 3500
time : 182.565385 or 24.492597
Mflops : 35020.581786
n: 3600
time : 198.551018 or 26.906992
Mflops : 34689.094992
n: 3700
time : 215.428919 or 28.574964
Mflops : 35462.294838
n: 3800
^C

CPU: Intel(R) Core(TM) i7 CPU 870  @ 2.93GHz (3313.71-MHz K8-class
CPU)
  Origin = GenuineIntel  Id = 0x106e5  Family = 6  Model = 1e  Stepping =
5

Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE

Features2=0x98e3fdSSE3,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,SSE4.1,SSE4.2,POPCNT
  AMD Features=0x28100800SYSCALL,NX,RDTSCP,LM
  AMD Features2=0x1LAHF
  TSC: P-state invariant

That's about 67% utilization, turning off HTT drops it more.  HTT on the
newer cores is good, not bad.





-- 
Adam Vande More
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-14 Thread Adam Vande More
On Wed, Apr 14, 2010 at 11:34 AM, Adam Vande More amvandem...@gmail.comwrote:




 That's about 67% utilization, turning off HTT drops it more.  HTT on the
 newer cores is good, not bad.


Well that was completely contrarty to some tests I'd run when I first got
the cpu.

With HTT off:

n: 3000
time : 44.705516 or 11.760183
Mflops : 45932.959253
n: 3100
time : 50.598581 or 14.270123
Mflops : 41766.437458
n: 3200
time : 55.748192 or 15.780977
Mflops : 41541.458400
n: 3300
time : 62.072217 or 17.441431
Mflops : 41221.262070
n: 3400

so that's about 79% right there.

also if I run cpuset on the dgemm then the utilization is basically at the
theoretical max for one core so at least that part is working.

-- 
Adam Vande More
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-14 Thread Andriy Gapon
on 14/04/2010 19:45 Adam Vande More said the following:
 
 also if I run cpuset on the dgemm then the utilization is basically at
 the theoretical max for one core so at least that part is working.

You can also try procstat -t pid to find out thread IDs and cpuset -t to pin 
the
threads to the cores.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-14 Thread Adam Vande More
On Wed, Apr 14, 2010 at 11:51 AM, Andriy Gapon a...@freebsd.org wrote:

 on 14/04/2010 19:45 Adam Vande More said the following:
 
  also if I run cpuset on the dgemm then the utilization is basically at
  the theoretical max for one core so at least that part is working.

 You can also try procstat -t pid to find out thread IDs and cpuset -t to
 pin the
 threads to the cores.


it gets to around 90% doing that.

time : 103.617271 or 27.140992
Mflops : 47172.925449
n: 4100
time : 113.910669 or 30.520677
Mflops : 45174.496186
n: 4200
time : 121.880695 or 32.068070
Mflops : 46217.711013
n: 4300

tried a couple of different thread orders but didn't seem to make a
difference.

galacticdominator% procstat -t 1922
  PIDTID COMM TDNAME   CPU  PRI STATE   WCHAN
 1922 100092 dgemminitial thread 0  190 run -
 1922 100268 dgemm-  1  190 run -
 1922 100270 dgemm-  1  191 run -
 1922 100272 dgemm-  3  190 run -
 1922 100273 dgemm-  2  191 run -
 1922 100274 dgemm-  2  191 run -
 1922 100282 dgemm-  0  190 run -
 1922 100283 dgemm-  3  190 run -

galacticdominator% cpuset -t 100092 -l 0
galacticdominator% cpuset -t 100268 -l 1
galacticdominator% cpuset -t 100270 -l 2
galacticdominator% cpuset -t 100272 -l 3
galacticdominator% cpuset -t 100273 -l 0
galacticdominator% cpuset -t 100274 -l 1
galacticdominator% cpuset -t 100282 -l 2
galacticdominator% cpuset -t 100283 -l 3


galacticdominator% cpuset -t 100092 -l 0
galacticdominator% cpuset -t 100268 -l 0
galacticdominator% cpuset -t 100270 -l 1
galacticdominator% cpuset -t 100272 -l 1
galacticdominator% cpuset -t 100273 -l 2
galacticdominator% cpuset -t 100274 -l 2
galacticdominator% cpuset -t 100282 -l 3
galacticdominator% cpuset -t 100283 -l 3


This is from the second set:

time : 150.348850 or 40.488350
Mflops : 45022.951141
n: 4600
time : 161.968982 or 43.589618
Mflops : 44669.884500
n: 4700

Since this is a full fledged desktop environment, 90% utilization seems
pretty good.  I'm no expert Andriy, but it seems like if gotoblas
implemented some of the FreeBSD optimizations then we'd be in the same
ballpark.


-- 
Adam Vande More
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-14 Thread Maho NAKATA
From: Andriy Gapon a...@freebsd.org
Subject: Re: How to reproduce: Re: Only 70% of theoretical peak performance on 
FreeBSD 8/amd64, Corei7 920
Date: Wed, 14 Apr 2010 16:19:13 +0300

 on 14/04/2010 02:21 Maho NAKATA said the following:
 2. install ports/math/gotoblas (manual download required)
  make install 
 
 
 Do you know how gotoblas on Linux was obtained?
Yes. Just download the archive.

 Was it built from source?
Yes.
 Has it come pre-packaged?
No.

 If so, can you find out details of its build configuration?

I'm not sure
I build like following on Ubuntu 9.10 amd64.

$ tar xvfz GotoBLAS2-1.13.tar.gz
$ cd GotoBLAS2
$ ./quickbuild.64bit
ln -fs libgoto2_nehalemp-r1.13.a libgoto2.a
for d in interface driver/level2 driver/level3 driver/others kernel  lapack ; \
do if test -d $d; then \
  make -j 8 -C $d libs || exit 1 ; \
fi; \
done
make[1]: Entering directory `/home/maho/a/GotoBLAS2/interface'
gcc -O2 -DEXPRECISION -m128bit-long-double -Wall -m64 -DF_INTERFACE_GFORT -fPIC 
 -DSMP_SERVER -DMAX_CPU_NUMBER=8 -DASMNAME=saxpy -DASMFNAME=saxpy_ 
-DNAME=saxpy_ -DCNAME=saxpy -DCHAR_NAME=\saxpy_\ -DCHAR_CNAME=\saxpy\ -I.. 
-I. -UDOUBLE  -UCOMPLEX -c axpy.c -o saxpy.o
gcc -O2 -DEXPRECISION -m128bit-long-double -Wall -m64 -DF_INTERFACE_GFORT -fPIC 
 -DSMP_SERVER -DMAX_CPU_NUMBER=8 -DASMNAME=sswap -DASMFNAME=sswap_ 
-DNAME=sswap_ -DCNAME=sswap -DCHAR_NAME=\sswap_\ -DCHAR_CNAME=\sswap\ -I.. 
-I. -UDOUBLE  -UCOMPLEX -c swap.c -o sswap.o
gcc -O2 -DEXPRECISION -m128bit-long-double -Wall -m64 -DF_INTERFACE_GFORT -fPIC 
 -DSMP_SERVER -DMAX_CPU_NUMBER=8 -DASMNAME=scopy -DASMFNAME=scopy_ 
-DNAME=scopy_ -DCNAME=scopy -DCHAR_NAME=\scopy_\ -DCHAR_CNAME=\scopy\ -I.. 
-I. -UDOUBLE  -UCOMPLEX -c copy.c -o scopy.o
gcc -O2 -DEXPRECISION -m128bit-long-double -Wall -m64 -DF_INTERFACE_GFORT -fPIC 
 -DSMP_SERVER -DMAX_CPU_NUMBER=8 -DASMNAME=sscal -DASMFNAME=sscal_ 
-DNAME=sscal_ -DCNAME=sscal -DCHAR_NAME=\sscal_\ -DCHAR_CNAME=\sscal\ -I.. 
-I. -UDOUBLE  -UCOMPLEX -c scal.c -o sscal.o


-- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/ 
   Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


HyperThreading makes worse to me (was Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920)

2010-04-14 Thread Maho NAKATA
Hi Andry and Adam

My test again. No desktop, etc. I just run dgemm.
Contrary to Adam's result, Hyper Threading makes the performance worse.
all tests are done on Core i7 920 @ 2.67GHz. (TurboBoost @2.8GHz)

Turbo Boost off, Hyper threading off: 82% (35GFlops)[1]
Turbo Boost off, Hyper threading off: 72% (30.5GFlops)  [2]

Turbo Boost on,  Hyper threading on: 71% (32GFlops)[3]
Turbo Boost off, Hyper threading off: 84-89% (38-40GFlops) [4]

---my system---
CPU: Intel(R) Core(TM) i7 CPU 920  @ 2.67GHz (2683.44-MHz K8-class CPU)
  Origin = GenuineIntel  Id = 0x106a5  Stepping = 5
  
Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE
  
Features2=0x98e3bdSSE3,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,SSE4.1,SSE4.2,POPCNT
  AMD Features=0x28100800SYSCALL,NX,RDTSCP,LM
  AMD Features2=0x1LAHF
  TSC: P-state invariant
real memory  = 12884901888 (12288 MB)
avail memory = 12387717120 (11813 MB)
ACPI APIC Table: 110909 APIC1026
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
FreeBSD/SMP: 1 package(s) x 4 core(s)
---my system---

---DETAILS---
[1]
% ./dgemm
n: 3000
time : 57.666717 or 16.339074 
Mflops : 33060.624827
n: 3100
time : 61.502677 or 16.597376 
Mflops : 35910.025544
n: 3200
time : 69.075401 or 19.199833 
Mflops : 34144.297133
n: 3300
time : 73.699540 or 19.633594 
Mflops : 36618.756539
n: 3400
time : 82.256194 or 22.373651 
Mflops : 35144.518837
n: 3500
time : 88.975662 or 24.118761 
Mflops : 35563.394249
n: 3600
time : 96.436652 or 26.027588 
Mflops : 35861.148385
n: 3700
[2]
% ./dgemm
n: 3000
time : 139.622739 or 17.693806 
Mflops : 30529.327312
n: 3100
time : 154.344971 or 19.566886 
Mflops : 30460.247702
n: 3200
time : 169.507739 or 21.467100 
Mflops : 30538.116602
n: 3300
time : 186.363773 or 23.615281 
Mflops : 30444.600545
n: 3400
time : 203.798979 or 25.817667 
Mflops : 30456.322788
n: 3500
...
[3]
% ./dgemm
n: 3000
time : 134.673079 or 16.958682 
Mflops : 31852.711082
n: 3100
time : 148.410085 or 18.663248 
Mflops : 31935.073574
n: 3200
time : 162.835473 or 20.468825 
Mflops : 32027.475770
n: 3300
time : 179.025370 or 22.479189 
Mflops : 31983.262501
n: 3400
time : 195.859710 or 24.663009 
Mflops : 31882.208788
n: 3500
[4]
% ./dgemm
n: 3000
time : 54.259647 or 14.684309 
Mflops : 36786.204907
n: 3100
time : 60.899147 or 17.124599 
Mflops : 34804.447141
n: 3200
time : 64.295342 or 17.490787 
Mflops : 37480.577569
n: 3300
time : 69.781247 or 18.288840 
Mflops : 39311.284796
n: 3400
time : 79.234397 or 21.829736 
Mflops : 36020.187858
n: 3500
time : 83.905419 or 22.381237 
Mflops : 38324.289174
n: 3600
time : 92.195022 or 25.105942 
Mflops : 37177.621122
n: 3700
time : 97.718841 or 25.434243 
Mflops : 39841.319494
n: 3800
time : 105.740463 or 27.414029 
Mflops : 40042.592613
n: 3900
time : 113.980157 or 29.678505 
Mflops : 39984.635420
n: 4000
time : 122.941569 or 31.946174 
Mflops : 40077.412531
n: 4100
---DETAILS---


From: Adam Vande More amvandem...@gmail.com
Subject: Re: How to reproduce: Re: Only 70% of theoretical peak performance on 
FreeBSD 8/amd64, Corei7 920
Date: Wed, 14 Apr 2010 11:34:45 -0500

  time : 162.45 or 20.430651
  Mflops : 32087.318295
  n: 3300
  time : 178.497079 or 22.446093
  Mflops : 32030.420499
  n: 3400
  time : 195.550715 or 24.586152
  Mflops : 31981.873273
  n: 3500
  time : 213.403379 or 26.825058
  Mflops : 31975.513363
  n: 3600
  ...
  above output is on Core i7 920 (2.66GHz; TurboBoost on)

 My results:
 $ ./dgemm
 n: 3000
 time : 54.151302 or 28.189781
 Mflops : 19162.263125
 n: 3100
 time : 60.157449 or 32.214141
 Mflops : 18501.570537
 n: 3200
 time : 65.753191 or 34.114872
 Mflops : 19216.393378

 CPU:
 CPU: Intel(R) Core(TM)2 Duo CPU E7300  @ 2.66GHz (2653.35-MHz K8-class
 CPU)
  Origin = GenuineIntel  Id = 0x10676  Stepping = 6


 Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE

  
 Features2=0x8e39dSSE3,DTES64,MON,DS_CPL,EST,TM2,SSSE3,CX16,xTPR,PDCM,SSE4.1
  AMD Features=0x20100800SYSCALL,NX,LM
  AMD Features2=0x1LAHF
  TSC: P-state invariant
 ⋮
 FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
 FreeBSD/SMP: 1 package(s) x 2 core(s)

 FreeBSD:
 FreeBSD 8.0-STABLE r205070 amd64

 Please note that the system was not dedicated to the test, I had
 Xorg+KDE3+thunderbird+skype+kopete+konsole(s) plus a bunch of daemons
 running.
 That probably explains irregularities in the results.

 I am not sure how exactly theoretical maximum should be calculated, I used
 2 *
 2.66G * 4 ≈ 21.3G.
 And so 19.2G / 21.3G ≈ 90%.

 Not as bad as what you get.
 Although not as good as what you report for Linux.
 But given the impurity and imprecision of my test…

 P.S. the machine is two-core obviously :-)
 Don't have anything with more cpus/cores handy.
___
freebsd-stable@freebsd.org mailing list
http

Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-14 Thread Maho NAKATA
opps I missed this e-mail...

From: Adam Vande More amvandem...@gmail.com
Subject: Re: How to reproduce: Re: Only 70% of theoretical peak performance on 
FreeBSD 8/amd64, Corei7 920
Date: Wed, 14 Apr 2010 11:45:04 -0500

 On Wed, Apr 14, 2010 at 11:34 AM, Adam Vande More 
 amvandem...@gmail.comwrote:
 



 That's about 67% utilization, turning off HTT drops it more.  HTT on the
 newer cores is good, not bad.

 
 Well that was completely contrarty to some tests I'd run when I first got
 the cpu.
 
 With HTT off:
 
 n: 3000
 time : 44.705516 or 11.760183
 Mflops : 45932.959253
 n: 3100
 time : 50.598581 or 14.270123
 Mflops : 41766.437458
 n: 3200
 time : 55.748192 or 15.780977
 Mflops : 41541.458400
 n: 3300
 time : 62.072217 or 17.441431
 Mflops : 41221.262070
 n: 3400
 
 so that's about 79% right there.
 
 also if I run cpuset on the dgemm then the utilization is basically at the
 theoretical max for one core so at least that part is working.
 
 -- 
 Adam Vande More
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-14 Thread Maho NAKATA
Hi Andriy and Adam,

I did also the same thing as suggested. 

my conclusion: on Core i7 920, 2.66GHz, TurboBoost on, HyperThreading off,
My result of dgemm GotoBLAS performance was following.

*summary of result 
36-39GFlops 81-87% of peak performance without pinning
35-40GFlops 78-89% of peak performance with pinning

my observation
* performance is somewhat unstable like 35GFlops then next calculation
is 40GFlops...and flips etc. jittering is observed.
* pinning makes performance somewhat stabler, but we don't gain a bit more.

Details.
First I ran
%./dgemm
n: 3500
time : 84.431008 or 22.428125 
Mflops : 38244.168629
n: 3600
time : 90.162220 or 23.440381 
Mflops : 39819.284422
n: 3700
time : 101.427504 or 27.404345 
Mflops : 36977.121646

Note: 36-39GFlops 81-87% of peak performance

then, pinned to each core like following

% procstat -t 1408
  PIDTID COMM TDNAME   CPU  PRI STATE   WCHAN
 1408 100160 dgemm-  3  190 run - 
 1408 100161 dgemm-  2  190 run - 
 1408 100162 dgemm-  2  190 run - 
 1408 100163 dgemm-  1  189 run - 
 1408 100164 dgemm-  0  190 run - 
 1408 100165 dgemm-  3  189 run - 
 1408 100166 dgemm-  1  190 run - 
 1408 100167 dgemminitial thread 0  190 run -  

% cpuset -t 100160 -l 0
% cpuset -t 100161 -l 0
% cpuset -t 100162 -l 1
% cpuset -t 100163 -l 1
% cpuset -t 100164 -l 2
% cpuset -t 100165 -l 2
% cpuset -t 100166 -l 3
% cpuset -t 100167 -l 3
then,
% procstat -t 1408
  PIDTID COMM TDNAME   CPU  PRI STATE   WCHAN
 1408 100160 dgemm-  0  191 run - 
 1408 100161 dgemm-  0  191 run - 
 1408 100162 dgemm-  1  190 run - 
 1408 100163 dgemm-  1  190 run - 
 1408 100164 dgemm-  2  190 run - 
 1408 100165 dgemm-  2  190 run - 
 1408 100166 dgemm-  3  190 run - 
 1408 100167 dgemminitial thread 3  190 run -   

n: 4000
time : 121.907696 or 31.475052 
Mflops : 40677.295630
n: 4100
time : 139.842701 or 38.702532 
Mflops : 35624.444587
n: 4200
time : 143.622179 or 36.725949 
Mflops : 40356.011158
n: 4300
time : 153.742976 or 39.465752 
Mflops : 40301.013511
n: 4400
time : 164.919566 or 42.380653 
Mflops : 40208.611317
n: 4500
time : 175.930335 or 45.422572 
Mflops : 40132.139469

Thanks

From: Adam Vande More amvandem...@gmail.com
Subject: Re: How to reproduce: Re: Only 70% of theoretical peak performance on 
FreeBSD 8/amd64, Corei7 920
Date: Wed, 14 Apr 2010 12:47:31 -0500

 On Wed, Apr 14, 2010 at 11:51 AM, Andriy Gapon a...@freebsd.org wrote:
 
 on 14/04/2010 19:45 Adam Vande More said the following:
 
  also if I run cpuset on the dgemm then the utilization is basically at
  the theoretical max for one core so at least that part is working.

 You can also try procstat -t pid to find out thread IDs and cpuset -t to
 pin the
 threads to the cores.

 
 it gets to around 90% doing that.
 
 time : 103.617271 or 27.140992
 Mflops : 47172.925449
 n: 4100
 time : 113.910669 or 30.520677
 Mflops : 45174.496186
 n: 4200
 time : 121.880695 or 32.068070
 Mflops : 46217.711013
 n: 4300
 
 tried a couple of different thread orders but didn't seem to make a
 difference.
 
 galacticdominator% procstat -t 1922
   PIDTID COMM TDNAME   CPU  PRI STATE   WCHAN
  1922 100092 dgemminitial thread 0  190 run -
  1922 100268 dgemm-  1  190 run -
  1922 100270 dgemm-  1  191 run -
  1922 100272 dgemm-  3  190 run -
  1922 100273 dgemm-  2  191 run -
  1922 100274 dgemm-  2  191 run -
  1922 100282 dgemm-  0  190 run -
  1922 100283 dgemm-  3  190 run -
 
 galacticdominator% cpuset -t 100092 -l 0
 galacticdominator% cpuset -t 100268 -l 1
 galacticdominator% cpuset -t 100270 -l 2
 galacticdominator% cpuset -t 100272 -l 3
 galacticdominator% cpuset -t 100273 -l 0
 galacticdominator% cpuset -t 100274 -l 1
 galacticdominator% cpuset -t 100282 -l 2
 galacticdominator% cpuset -t 100283 -l 3
 
 
 galacticdominator% cpuset -t 100092 -l 0
 galacticdominator% cpuset -t 100268 -l 0
 galacticdominator% cpuset -t 100270 -l 1
 galacticdominator% cpuset -t 100272 -l 1
 galacticdominator% cpuset -t 100273 -l 2
 galacticdominator% cpuset -t 100274 -l 2
 galacticdominator% cpuset -t 100282 -l 3
 galacticdominator% cpuset -t

Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-14 Thread Maho NAKATA
Hi Adam,

From: Adam Vande More amvandem...@gmail.com
Subject: Re: How to reproduce: Re: Only 70% of theoretical peak performance on 
FreeBSD 8/amd64, Corei7 920
Date: Wed, 14 Apr 2010 12:47:31 -0500

 Since this is a full fledged desktop environment, 90% utilization seems
 pretty good.  

No, I don't think so. Even on Ubuntu,
mine is running on a full desktop environment, GotoBLAS's performance
is about 95% using dgemm. dgemm on Linux is lot more stabler than FreeBSD
and clearly faster.

on Ubuntu
$ ./dgemm
n: 3000
time : 51.18 or 12.795519 
Mflops : 42216.341930
n: 3100
time : 56.28 or 14.261719 
Mflops : 41791.049205
n: 3200
time : 61.35 or 15.631380 
Mflops : 41939.023080
n: 3300
time : 67.79 or 17.247202 
Mflops : 41685.474166
n: 3400
time : 73.80 or 18.471321 
Mflops : 42569.300032
n: 3500
time : 81.48 or 20.781936 
Mflops : 41273.585044
n: 3600
time : 88.17 or 22.816965 
Mflops : 40907.246233
n: 3700
time : 95.21 or 23.864101 
Mflops : 42462.684969
n: 3800

thanks
-- Nakata Maho http://accc.riken.jp/maho/ , http://ja.openoffice.org/ 
   Nakata Maho's PGP public keys: http://accc.riken.jp/maho/maho.pgp.txt
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: HyperThreading makes worse to me (was Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920)

2010-04-14 Thread Garrett Cooper
On Wed, Apr 14, 2010 at 5:46 PM, Maho NAKATA cha...@mac.com wrote:
 Hi Andry and Adam

 My test again. No desktop, etc. I just run dgemm.
 Contrary to Adam's result, Hyper Threading makes the performance worse.
 all tests are done on Core i7 920 @ 2.67GHz. (TurboBoost @2.8GHz)

 Turbo Boost off, Hyper threading off: 82% (35GFlops)    [1]
 Turbo Boost off, Hyper threading off: 72% (30.5GFlops)  [2]

 Turbo Boost on,  Hyper threading on: 71% (32GFlops)    [3]
 Turbo Boost off, Hyper threading off: 84-89% (38-40GFlops) [4]

Doesn't this make sense? Hyperthreaded cores in Intel procs still
provide an incomplete set of registers as they're logical processors,
so I would expect for things to be slower if they're automatically run
on the SMT cores instead of the physical ones.

Is there a weighting scheme to SCHED_ULE where logical processors
(like the SMT variety) get a lower score than real processors do, and
thus get scheduled for less intensive interrupting tasks, or maybe
just don't get scheduled in high use scenarios like it would if it was
a physical processor?

Thanks,
-Garrett
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: HyperThreading makes worse to me (was Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920)

2010-04-14 Thread Garrett Cooper
On Wed, Apr 14, 2010 at 7:49 PM, Garrett Cooper yanef...@gmail.com wrote:
 On Wed, Apr 14, 2010 at 5:46 PM, Maho NAKATA cha...@mac.com wrote:
 Hi Andry and Adam

 My test again. No desktop, etc. I just run dgemm.
 Contrary to Adam's result, Hyper Threading makes the performance worse.
 all tests are done on Core i7 920 @ 2.67GHz. (TurboBoost @2.8GHz)

 Turbo Boost off, Hyper threading off: 82% (35GFlops)    [1]
 Turbo Boost off, Hyper threading off: 72% (30.5GFlops)  [2]

 Turbo Boost on,  Hyper threading on: 71% (32GFlops)    [3]
 Turbo Boost off, Hyper threading off: 84-89% (38-40GFlops) [4]

 Doesn't this make sense? Hyperthreaded cores in Intel procs still
 provide an incomplete set of registers as they're logical processors,
 so I would expect for things to be slower if they're automatically run
 on the SMT cores instead of the physical ones.

 Is there a weighting scheme to SCHED_ULE where logical processors
 (like the SMT variety) get a lower score than real processors do, and
 thus get scheduled for less intensive interrupting tasks, or maybe
 just don't get scheduled in high use scenarios like it would if it was
 a physical processor?

Err... wait. Didn't see that the turbo boost results didn't scale
linearly or align with one another until just a sec ago. Nevermind my
previous comment.
-Garrett
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: HyperThreading makes worse to me (was Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920)

2010-04-14 Thread Ian Smith
On Wed, 14 Apr 2010, Garrett Cooper wrote:
  On Wed, Apr 14, 2010 at 7:49 PM, Garrett Cooper yanef...@gmail.com wrote:
   On Wed, Apr 14, 2010 at 5:46 PM, Maho NAKATA cha...@mac.com wrote:
   Hi Andry and Adam
  
   My test again. No desktop, etc. I just run dgemm.
   Contrary to Adam's result, Hyper Threading makes the performance worse.
   all tests are done on Core i7 920 @ 2.67GHz. (TurboBoost @2.8GHz)
  
   Turbo Boost off, Hyper threading off: 82% (35GFlops)    [1]
   Turbo Boost off, Hyper threading off: 72% (30.5GFlops)  [2]

Er, shouldn't one of those say HTT on?  and/or Turbo boost on?  Else 
they're both the same test as [4] but with different results?

   Turbo Boost on,  Hyper threading on: 71% (32GFlops)    [3]
   Turbo Boost off, Hyper threading off: 84-89% (38-40GFlops) [4]

Clarification of all four possible test configs - 8 if you add pinning 
CPUs or not - might make this a bit clearer?

   Doesn't this make sense? Hyperthreaded cores in Intel procs still
   provide an incomplete set of registers as they're logical processors,
   so I would expect for things to be slower if they're automatically run
   on the SMT cores instead of the physical ones.

Since we're talking FP, do HTT 'cores' share an FPU, or have their own?  
If contended, you'd have to expect worse (at least FP) performance, no?

   Is there a weighting scheme to SCHED_ULE where logical processors
   (like the SMT variety) get a lower score than real processors do, and
   thus get scheduled for less intensive interrupting tasks, or maybe
   just don't get scheduled in high use scenarios like it would if it was
   a physical processor?
  
  Err... wait. Didn't see that the turbo boost results didn't scale
  linearly or align with one another until just a sec ago. Nevermind my
  previous comment.

Waiting for the fog to lift ..

cheers, Ian___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

Linux static linked ver doesn't work on FBSD (Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920)

2010-04-14 Thread Maho NAKATA
From: Pieter de Goeje pie...@degoeje.nl
Subject: Re: How to reproduce: Re: Only 70% of theoretical peak performance on 
FreeBSD 8/amd64, Corei7 920
Date: Wed, 14 Apr 2010 16:05:18 +0200

 I think the best test would be to run a statically compiled linux binary on 
 FreeBSD. That way the compiler settings are exactly the same.

It is not possible for Linux amd64 binary to run on FreeBSD amd64,
...and not i386 version neither. GotoBLAS uses special systeml call.

% ./dgemm
linux_sys_futex: unknown op 265
linux: pid 1264 (dgemm): syscall mbind not implemented
n: 3000
^C
just halt.
-- Nakata Maho http://accc.riken.jp/maho/ , JA OOO http://ja.openoffice.org/
   Blog: http://blog.goo.ne.jp/nakatamaho/ , GPG key: 
http://accc.riken.jp/maho/maho.pgp.txt
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: How to reproduce: Re: Only 70% of theoretical peak performance on FreeBSD 8/amd64, Corei7 920

2010-04-14 Thread Andriy Gapon
on 15/04/2010 04:20 Maho NAKATA said the following:
 Hi Andriy and Adam,
 
 I did also the same thing as suggested. 
 
 my conclusion: on Core i7 920, 2.66GHz, TurboBoost on, HyperThreading off,

So HyperThreading is off.

 then, pinned to each core like following
 
 % procstat -t 1408
   PIDTID COMM TDNAME   CPU  PRI STATE   WCHAN
  1408 100160 dgemm-  3  190 run - 
  1408 100161 dgemm-  2  190 run - 
  1408 100162 dgemm-  2  190 run - 
  1408 100163 dgemm-  1  189 run - 
  1408 100164 dgemm-  0  190 run - 
  1408 100165 dgemm-  3  189 run - 
  1408 100166 dgemm-  1  190 run - 
  1408 100167 dgemminitial thread 0  190 run -  

But there are still 8 threads.

Can you check how many threads you have on Linux with the same configuration?
Is it possible to tell GotoBLAS to use 4 threads?
If yes, can you also test that scenario?

Also, would it be possible for you to test recent 8-STABLE?
Just for the sake of experiment.
-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org