Re: [julia-users] Re: BLAS.set_num_threads() and peakflops scaling

Ralph Smith Fri, 21 Oct 2016 20:05:40 -0700

On looking more carefully, I believe I was mistaken about thread assignment 
to cores - that seems to be done well in OpenBLAS (and maybe Linux in 
general nowadays).  Perhaps the erratic benchmarks under hyperthreading - 
even after heap management is tamed - arise when the operating system 
detects idle virtual cores and schedules disruptive processes there.


On Friday, October 21, 2016 at 12:09:07 AM UTC-4, Ralph Smith wrote:
>
> That's interesting, I see the code in OpenBLAS. However, on the Linux 
> systems I use, when I had hyperthreading enabled the allocations looked 
> random, and I generally got less consistent benchmarks.  I'll have to check 
> that again.
>
> You can also avoid the memory allocation effects by something like
> using BenchmarkTools
> a = rand(n,n); b=rand(n,n); c = similar(a);
> @benchmark A_mul_B!($c,$a,$b)
>
> Of course this is only directly relevant to your real workload if that is 
> dominated by sections where you can optimize away allocations and memory 
> latency.
>
>
> On Thursday, October 20, 2016 at 11:00:41 PM UTC-4, Thomas Covert wrote:
>>
>> Thanks - I will try to figure out how to do that.  I will note, however, 
>> that the OpenBLAS FAQ suggests that OpenBLAS tries to avoid allocating 
>> threads to the same physical core on machines with hyper threading, so 
>> perhaps this is not the cause:
>>
>> https://github.com/xianyi/OpenBLAS/blob/master/GotoBLAS_03FAQ.txt
>>
>>
>>
>> On Thursday, October 20, 2016 at 4:45:51 PM UTC-5, Stefan Karpinski wrote:
>>>
>>> I think Ralph is suggesting that you disable the CPU's hyperthreading if 
>>> you run this kind of code often. We've done that on our benchmarking 
>>> machines, for example.
>>>
>>> On Wed, Oct 19, 2016 at 11:47 PM, Thomas Covert <thom....@gmail.com> 
>>> wrote:
>>>
>>>> So are you suggesting that real numerical workloads under 
>>>> BLAS.set_num_threads(4) will indeed be faster than 
>>>> under BLAS.set_num_threads(2)?  That hasn't been my experience and I 
>>>> figured the peakflops() example would constitute an MWE.  Is there another 
>>>> workload you would suggest I try to figure out if this is just a peak 
>>>> flops() aberration or a real issue?
>>>>
>>>>
>>>> On Wednesday, October 19, 2016 at 8:28:16 PM UTC-5, Ralph Smith wrote:
>>>>>
>>>>> At least 2 things contribute to erratic results from peakflops(). With 
>>>>> hyperthreading, the threads are not always put on separate cores. 
>>>>> Secondly, 
>>>>> the measured time includes
>>>>> the allocation of the result matrix, so garbage collection affects 
>>>>> some of the results.  Most available advice says to disable 
>>>>> hyperthreading 
>>>>> on dedicated number crunchers
>>>>> (most full loads work slightly more efficiently without the extra 
>>>>> context switching).  The GC issue seems to be a mistake, if "peak" is to 
>>>>> be 
>>>>> taken seriously.
>>>>>
>>>>> On Wednesday, October 19, 2016 at 12:04:00 PM UTC-4, Thomas Covert 
>>>>> wrote:
>>>>>>
>>>>>> I have a recent iMac with 4 logical cores (and 8 hyper threads).  I 
>>>>>> would have thought that peakflops(N) for a large enough N should be 
>>>>>> increasing in the number of threads I allow BLAS to use.  I do find that 
>>>>>> peakflops(N) with 1 thread is about half as high as peakflops(N) with 2 
>>>>>> threads, but there is no gain to 4 threads.  Are my expectations wrong 
>>>>>> here, or is it possible that BLAS is somehow configured incorrectly on 
>>>>>> my 
>>>>>> machine?  In the example below, N = 6755, a number relevant for my work, 
>>>>>> but the results are similar with 5000 or 10000.
>>>>>>
>>>>>> here is my versioninfo()
>>>>>> julia> versioninfo()
>>>>>> Julia Version 0.5.0
>>>>>> Commit 3c9d753* (2016-09-19 18:14 UTC)
>>>>>> Platform Info:
>>>>>>   System: Darwin (x86_64-apple-darwin15.6.0)
>>>>>>   CPU: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
>>>>>>   WORD_SIZE: 64
>>>>>>   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
>>>>>>   LAPACK: libopenblas
>>>>>>   LIBM: libopenlibm
>>>>>>   LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
>>>>>>
>>>>>> here is an example peakflops() exercise:
>>>>>> julia> BLAS.set_num_threads(1)
>>>>>>
>>>>>> julia> mean(peakflops(6755) for i=1:10)
>>>>>> 5.225580459387056e10
>>>>>>
>>>>>> julia> BLAS.set_num_threads(2)
>>>>>>
>>>>>> julia> mean(peakflops(6755) for i=1:10)
>>>>>> 1.004317640281997e11
>>>>>>
>>>>>> julia> BLAS.set_num_threads(4)
>>>>>>
>>>>>> julia> mean(peakflops(6755) for i=1:10)
>>>>>> 9.838116463900085e10
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>

Re: [julia-users] Re: BLAS.set_num_threads() and peakflops scaling

Reply via email to