On looking more carefully, I believe I was mistaken about thread assignment to cores - that seems to be done well in OpenBLAS (and maybe Linux in general nowadays). Perhaps the erratic benchmarks under hyperthreading - even after heap management is tamed - arise when the operating system detects idle virtual cores and schedules disruptive processes there.
On Friday, October 21, 2016 at 12:09:07 AM UTC-4, Ralph Smith wrote: > > That's interesting, I see the code in OpenBLAS. However, on the Linux > systems I use, when I had hyperthreading enabled the allocations looked > random, and I generally got less consistent benchmarks. I'll have to check > that again. > > You can also avoid the memory allocation effects by something like > using BenchmarkTools > a = rand(n,n); b=rand(n,n); c = similar(a); > @benchmark A_mul_B!($c,$a,$b) > > Of course this is only directly relevant to your real workload if that is > dominated by sections where you can optimize away allocations and memory > latency. > > > On Thursday, October 20, 2016 at 11:00:41 PM UTC-4, Thomas Covert wrote: >> >> Thanks - I will try to figure out how to do that. I will note, however, >> that the OpenBLAS FAQ suggests that OpenBLAS tries to avoid allocating >> threads to the same physical core on machines with hyper threading, so >> perhaps this is not the cause: >> >> https://github.com/xianyi/OpenBLAS/blob/master/GotoBLAS_03FAQ.txt >> >> >> >> On Thursday, October 20, 2016 at 4:45:51 PM UTC-5, Stefan Karpinski wrote: >>> >>> I think Ralph is suggesting that you disable the CPU's hyperthreading if >>> you run this kind of code often. We've done that on our benchmarking >>> machines, for example. >>> >>> On Wed, Oct 19, 2016 at 11:47 PM, Thomas Covert <thom....@gmail.com> >>> wrote: >>> >>>> So are you suggesting that real numerical workloads under >>>> BLAS.set_num_threads(4) will indeed be faster than >>>> under BLAS.set_num_threads(2)? That hasn't been my experience and I >>>> figured the peakflops() example would constitute an MWE. Is there another >>>> workload you would suggest I try to figure out if this is just a peak >>>> flops() aberration or a real issue? >>>> >>>> >>>> On Wednesday, October 19, 2016 at 8:28:16 PM UTC-5, Ralph Smith wrote: >>>>> >>>>> At least 2 things contribute to erratic results from peakflops(). With >>>>> hyperthreading, the threads are not always put on separate cores. >>>>> Secondly, >>>>> the measured time includes >>>>> the allocation of the result matrix, so garbage collection affects >>>>> some of the results. Most available advice says to disable >>>>> hyperthreading >>>>> on dedicated number crunchers >>>>> (most full loads work slightly more efficiently without the extra >>>>> context switching). The GC issue seems to be a mistake, if "peak" is to >>>>> be >>>>> taken seriously. >>>>> >>>>> On Wednesday, October 19, 2016 at 12:04:00 PM UTC-4, Thomas Covert >>>>> wrote: >>>>>> >>>>>> I have a recent iMac with 4 logical cores (and 8 hyper threads). I >>>>>> would have thought that peakflops(N) for a large enough N should be >>>>>> increasing in the number of threads I allow BLAS to use. I do find that >>>>>> peakflops(N) with 1 thread is about half as high as peakflops(N) with 2 >>>>>> threads, but there is no gain to 4 threads. Are my expectations wrong >>>>>> here, or is it possible that BLAS is somehow configured incorrectly on >>>>>> my >>>>>> machine? In the example below, N = 6755, a number relevant for my work, >>>>>> but the results are similar with 5000 or 10000. >>>>>> >>>>>> here is my versioninfo() >>>>>> julia> versioninfo() >>>>>> Julia Version 0.5.0 >>>>>> Commit 3c9d753* (2016-09-19 18:14 UTC) >>>>>> Platform Info: >>>>>> System: Darwin (x86_64-apple-darwin15.6.0) >>>>>> CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz >>>>>> WORD_SIZE: 64 >>>>>> BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell) >>>>>> LAPACK: libopenblas >>>>>> LIBM: libopenlibm >>>>>> LLVM: libLLVM-3.7.1 (ORCJIT, haswell) >>>>>> >>>>>> here is an example peakflops() exercise: >>>>>> julia> BLAS.set_num_threads(1) >>>>>> >>>>>> julia> mean(peakflops(6755) for i=1:10) >>>>>> 5.225580459387056e10 >>>>>> >>>>>> julia> BLAS.set_num_threads(2) >>>>>> >>>>>> julia> mean(peakflops(6755) for i=1:10) >>>>>> 1.004317640281997e11 >>>>>> >>>>>> julia> BLAS.set_num_threads(4) >>>>>> >>>>>> julia> mean(peakflops(6755) for i=1:10) >>>>>> 9.838116463900085e10 >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>