That is very interesting. Any idea why they are doing this?
-viral
On Friday, December 5, 2014 3:47:19 AM UTC+5:30, Johan Sigfrids wrote:
>
> The new AMD architectures are weird in that they have two integer cores
> share the same FP hardware so you half the FP cores compared to integer
> cores. The reported number of cores is based in integer cores.
>
> On Thursday, December 4, 2014 11:13:38 PM UTC+2, Douglas Bates wrote:
>>
>> On Thursday, December 4, 2014 2:32:01 PM UTC-6, Stefan Karpinski wrote:
>>>
>>> Hyperthreading? Of the threshold is 16 but you're really only getting 8
>>> cores, you might only get scaling up to 8.
>>>
>>
>> This machine has AMD Opteron processors. I know Intel uses
>> hyperthreading, does AMD also use it?
>>
>> I recompiled OpenBLAS setting the NUM_THREADS to 32 but still get the
>> same result - essentially no difference between 8 and 16 threads.
>> julia> blas_set_num_threads(4)
>>
>> julia> [peakflops(8000)::Float64 for i in 1:6]
>> 6-element Array{Float64,1}:
>> 8.66448e10
>> 8.67398e10
>> 8.67465e10
>> 8.68957e10
>> 8.69717e10
>> 8.70661e10
>>
>> julia> blas_set_num_threads(8)
>>
>> julia> [peakflops(8000)::Float64 for i in 1:6]
>> 6-element Array{Float64,1}:
>> 1.67257e11
>> 1.66041e11
>> 1.65284e11
>> 1.65565e11
>> 1.65867e11
>> 1.65596e11
>>
>> julia> blas_set_num_threads(16)
>>
>> julia> [peakflops(8000)::Float64 for i in 1:6]
>> 6-element Array{Float64,1}:
>> 1.65354e11
>> 1.7099e11
>> 1.70911e11
>> 1.71407e11
>> 1.71238e11
>> 1.70983e11
>>
>>
>>
>>
>>
>>>
>>>
>>> > On Dec 4, 2014, at 3:24 PM, Viral Shah <[email protected]> wrote:
>>> >
>>> >
>>> >> On 05-Dec-2014, at 1:32 am, Douglas Bates <[email protected]> wrote:
>>> >>
>>> >> On Thursday, December 4, 2014 1:50:06 PM UTC-6, Viral Shah wrote:
>>> >>> On 05-Dec-2014, at 1:16 am, Douglas Bates <[email protected]
>>> <javascript:>> wrote:
>>> >>>
>>> >>> Thanks, I'll try that. I'm still curious as to why there is so
>>> little difference between 8 and 16 threads.
>>> >>
>>> >> peakflops() just performs a matrix multiplication to estimate the
>>> flops. It uses a 2000x2000 matrix by default, which is good for most
>>> laptops, but for bigger machines with more cores, one often needs to use a
>>> larger matrix to see the speedup.
>>> >>
>>> >> peakflops(8000) should give a good indication. I am not sure what the
>>> running time will be, so you may want to gradually increase the size.
>>> >>
>>> >>
>>> >> 8000 is reasonable on this machine and it does stabilize the results
>>> from repeated timings. But I still have essentially no difference between
>>> 8 and 16 threads. I wonder if somehow the NUM_THREADS is being set to 8,
>>> although looking in the deps/Makefile it does seem that it should be 16
>>> >
>>> >
>>> > I tried on julia.mit.edu, and I do see a scale up from 1->16
>>> processors with peakflops(4000). That seems to suggest that the build is
>>> ok, and openblas can scale. I think it would be best to check with Xianyi
>>> about this - perhaps file an issue against OpenBLAS?
>>> >
>>> > Perhaps someone here may have some other ideas too.
>>> >
>>> > -viral
>>> >
>>> >
>>> >>
>>> >> julia> blas_set_num_threads(4)
>>> >>
>>> >> julia> [peakflops(8000)::Float64 for i in 1:6]
>>> >> 6-element Array{Float64,1}:
>>> >> 8.66823e10
>>> >> 8.65584e10
>>> >> 8.65692e10
>>> >> 8.64753e10
>>> >> 8.64083e10
>>> >> 8.63359e10
>>> >>
>>> >> julia> blas_set_num_threads(8)
>>> >>
>>> >> julia> [peakflops(8000)::Float64 for i in 1:6]
>>> >> 6-element Array{Float64,1}:
>>> >> 1.68008e11
>>> >> 1.67772e11
>>> >> 1.67378e11
>>> >> 1.67397e11
>>> >> 1.6746e11
>>> >> 1.67623e11
>>> >>
>>> >> julia> blas_set_num_threads(16)
>>> >>
>>> >> julia> [peakflops(8000)::Float64 for i in 1:6]
>>> >> 6-element Array{Float64,1}:
>>> >> 1.66779e11
>>> >> 1.70068e11
>>> >> 1.698e11
>>> >> 1.70419e11
>>> >> 1.70601e11
>>> >> 1.67226e11
>>> >>
>>> >>
>>> >>
>>> >> -viral
>>> >>
>>> >>
>>> >>
>>> >>>
>>> >>> -viral
>>> >>>
>>> >>> On Friday, December 5, 2014 1:00:39 AM UTC+5:30, Douglas Bates
>>> wrote:
>>> >>> I have been working on a package
>>> https://github.com/dmbates/ParalllelGLM.jl <
>>> https://github.com/dmbates/ParalllelGLM.jl> and noticed some
>>> peculiarities in the timings on a couple of shared-memory servers, each
>>> with 32 cores. In particular changing from 16 workers to 32 workers
>>> actually slowed down the fitting process. So I decided to check how
>>> changing the number of OpenBLAS threads affected the peakflops() result. I
>>> end up with essentially the same results for 8, 16 and 32 threads on this
>>> machine with 32 cores. Is that to be expected?
>>> >>>
>>> >>> _ _ _(_)_ | A fresh approach to technical computing
>>> >>> (_) | (_) (_) | Documentation: http://docs.julialang.org <
>>> http://docs.julialang.org/>
>>> >>> _ _ _| |_ __ _ | Type "help()" for help.
>>> >>> | | | | | | |/ _` | |
>>> >>> | | |_| | | | (_| | | Version 0.4.0-dev+1944 (2014-12-04 15:06
>>> UTC)
>>> >>> _/ |\__'_|_|_|\__'_| | Commit 87e9ee1* (0 days old master)
>>> >>> |__/ | x86_64-unknown-linux-gnu
>>> >>>
>>> >>> julia> [peakflops()::Float64 for i in 1:6]
>>> >>> 6-element Array{Float64,1}:
>>> >>> 1.41151e11
>>> >>> 1.1676e11
>>> >>> 1.27597e11
>>> >>> 1.27607e11
>>> >>> 1.27518e11
>>> >>> 1.27478e11
>>> >>>
>>> >>> julia> CPU_CORES
>>> >>> 32
>>> >>>
>>> >>> julia> blas_set_num_threads(16)
>>> >>>
>>> >>> julia> [peakflops()::Float64 for i in 1:6]
>>> >>> 6-element Array{Float64,1}:
>>> >>> 1.23523e11
>>> >>> 1.27119e11
>>> >>> 1.11381e11
>>> >>> 1.17847e11
>>> >>> 1.28415e11
>>> >>> 1.17998e11
>>> >>>
>>> >>> julia> blas_set_num_threads(8)
>>> >>>
>>> >>> julia> [peakflops()::Float64 for i in 1:6]
>>> >>> 6-element Array{Float64,1}:
>>> >>> 1.25194e11
>>> >>> 1.20969e11
>>> >>> 1.25777e11
>>> >>> 1.20757e11
>>> >>> 1.26086e11
>>> >>> 1.20958e11
>>> >>>
>>> >>> julia> versioninfo(true)
>>> >>> Julia Version 0.4.0-dev+1944
>>> >>> Commit 87e9ee1* (2014-12-04 15:06 UTC)
>>> >>> Platform Info:
>>> >>> System: Linux (x86_64-unknown-linux-gnu)
>>> >>> CPU: AMD Opteron(tm) Processor 6328
>>> >>> WORD_SIZE: 64
>>> >>> "Red Hat Enterprise Linux Server release 6.5 (Santiago)"
>>> >>> uname: Linux 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Dec 13 06:58:20
>>> EST 2013 x86_64 x86_64
>>> >>> Memory: 504.78467178344727 GB (508598.8125 MB free)
>>> >>> Uptime: 261586.0 sec
>>> >>> Load Avg: 0.08740234375 0.19384765625 0.8330078125
>>> >>> AMD Opteron(tm) Processor 6328 :
>>> >>> speed user nice sys idle
>>> irq
>>> >>> #1-32 3199 MHz 1855973 s 23392 s 670932 s 834073187 s
>>> 21 s
>>> >>>
>>> >>> BLAS: libopenblas (USE64BITINT NO_AFFINITY PILEDRIVER)
>>> >>> LAPACK: libopenblas
>>> >>> LIBM: libopenlibm
>>> >>> LLVM: libLLVM-3.5.0
>>> >>> Environment:
>>> >>> TERM = screen
>>> >>> PATH =
>>> /s/cmake-3.0.2/bin:/s/gcc-4.9.2/bin:./u/b/a/bates/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/s/std/bin:/usr/afsws/bin:
>>>
>>>
>>> >>> WWW_HOME = http://www.stat.wisc.edu/ <http://www.stat.wisc.edu/>
>>> >>> JULIA_PKGDIR = /scratch/bates/.julia
>>> >>> HOME = /u/b/a/bates
>>> >>>
>>> >>> Package Directory: /scratch/bates/.julia/v0.4
>>> >>> 2 required packages:
>>> >>> - Distributions 0.6.1
>>> >>> - Docile 0.3.2
>>> >>> 5 additional packages:
>>> >>> - ArrayViews 0.4.8
>>> >>> - Compat 0.2.5
>>> >>> - PDMats 0.3.1
>>> >>> - ParallelGLM 0.0.0- master
>>> (unregistered)
>>> >>> - StatsBase 0.6.10
>>> >
>>>
>>