Le mercredi 06 avril 2016 à 07:25 -0700, Johannes Wagner a écrit :
> and last update, I disabled Hyperthreading on the i7, and now it
> performes as expected.
> 
> i7 no HT:
> 
> 1000000 loops, best of 3: 879.57 ns per loop
> 100000 loops, best of 3: 9.88 µs per loop
> 100 loops, best of 3: 4.46 ms per loop
> 10000 loops, best of 3: 69.89 µs per loop
> 10000 loops, best of 3: 26.67 µs per loop
> 10 loops, best of 3: 95.08 ms per loop
> 
> i7 with HT:
> 
> 1000000 loops, best of 3: 871.68 ns per loop
> 10000 loops, best of 3: 10.84 µs per loop
> 100 loops, best of 3: 5.19 ms per loop
> 10000 loops, best of 3: 71.35 µs per loop
> 10000 loops, best of 3: 26.65 µs per loop
> 1 loops, best of 3: 159.99 ms per loop
> 
> So all calls inside the loop are the same speed, but the whole loop,
> with identical assembly code is ~60% slower if HT is enabled. Where
> can this problem then arise from? LLVM? or thread pinning in the OS?
> Probably not a julia problem then...
Indeed, in the last assembly output you sent, there are no differences
between i5 and i7 (as expected). So this isn't Julia's nor LLVM's
fault. No idea whether there might be an issue with the CPU itself, but
it's quite surprising.


Regards


> > Le mardi 05 avril 2016 à 10:18 -0700, Johannes Wagner a écrit : 
> > > hey Milan, 
> > > so consider following code: 
> > > 
> > > Pkg.clone("git://github.com/kbarbary/TimeIt.jl.git") 
> > > using TimeIt 
> > > 
> > > v = rand(3) 
> > > r = rand(6000,3) 
> > > x = linspace(0.0, 10.0, 500) * (v./sqrt(sumabs2(v)))' 
> > > 
> > > dotprods = r * x[2,:] 
> > > imexp    = cis(dotprods) 
> > > sumprod  = sum(imexp) * sum(conj(imexp)) 
> > > 
> > > f(r, x) = r * x[2,:]     
> > > g(r, x) = r * x' 
> > > h(imexp)    = sum(imexp) * sum(conj(imexp)) 
> > > 
> > > function s(r, x) 
> > >         result = zeros(size(x,1)) 
> > >         for i = 1:size(x,1) 
> > >                 imexp    = cis(r * x[i,:]) 
> > >                 result[i]= sum(imexp) * sum(conj(imexp)) 
> > >         end 
> > >         return result 
> > > end 
> > > 
> > > @timeit zeros(size(x,1)) 
> > > @timeit f(r,x) 
> > > @timeit g(r,x) 
> > > @timeit cis(dotprods) 
> > > @timeit h(imexp) 
> > > @timeit s(r,x) 
> > > 
> > > @code_native f(r,x) 
> > > @code_native g(r,x) 
> > > @code_native cis(dotprods) 
> > > @code_native h(imexp) 
> > > @code_native s(r,x) 
> > > 
> > > and I attached the output of the last @code_native s(r,x) as
> > text 
> > > files for the binary tarball, as well as the latest nalimilan
> > update. 
> > > For the whole function s, the exported code looks actually the
> > same 
> > > everywhere. 
> > > But s(r,x) is the one that is considerable slower on the i7 than
> > the 
> > > i5, whereas all the other timed calls are more or less same speed
> > on 
> > > i5 and i7. Here are the timings in the same order as above (all
> > run 
> > > repeatedly to not have compile time in it for last one): 
> > > 
> > > i7: 
> > > 1000000 loops, best of 3: 871.68 ns per loop 
> > > 10000 loops, best of 3: 10.84 µs per loop 
> > > 100 loops, best of 3: 5.19 ms per loop 
> > > 10000 loops, best of 3: 71.35 µs per loop 
> > > 10000 loops, best of 3: 26.65 µs per loop 
> > > 1 loops, best of 3: 159.99 ms per loop 
> > > 
> > > i5: 
> > > 100000 loops, best of 3: 1.01 µs per loop 
> > > 10000 loops, best of 3: 10.93 µs per loop 
> > > 100 loops, best of 3: 5.09 ms per loop 
> > > 10000 loops, best of 3: 75.93 µs per loop 
> > > 10000 loops, best of 3: 29.23 µs per loop 
> > > 1 loops, best of 3: 103.70 ms per loop 
> > > 
> > > So based on inside s(r,x) calls, the i7 should be faster, but
> > the 
> > > whole s(r,x) is slower. Still clueless... And don't know how to 
> > > further pin this down... 
> > Thanks. I think you got mixed up with the different files, as the 
> > versioninfo() output indicates. Anyway, there's enough info to
> > check 
> > which file corresponds to which Julia version, so that's OK.
> > Indeed, 
> > when comparing the tests with binary tarballs, there's a call 
> > to jl_alloc_array_1d with the i7 (julia050_tarball-haswell-
> > i7.txt), 
> > which is not present with the i5 (incorrectly
> > named julia050_haswell- 
> > i7.txt). This is really unexpected. 
> > 
> > Could you file an issue on GitHub with a summary of what we've
> > found 
> > (essentially your message), as well as links to 3 Gists giving the
> > code 
> > and the contents of the two .txt files I mentioned above? That
> > would be 
> > very helpful. Do not mention the Fedora packages at all, as the
> > binary 
> > tarballs are closer to what Julia developers use. 
> > 
> > 
> > Regards 
> > 
> > 
> > > cheers, Johannes 
> > > 
> > > 
> > > 
> > > 
> > > > Le lundi 04 avril 2016 à 10:36 -0700, Johannes Wagner a
> > écrit :  
> > > > > hey guys,  
> > > > > so attached you find text files with @code_native output for
> > the  
> > > > > instructions   
> > > > > - r * x[1,:]  
> > > > > - cis(imexp)  
> > > > > - sum(imexp) * sum(conj(imexp))  
> > > > >  
> > > > > for julia 0.5.   
> > > > >  
> > > > > Hardware I run on is a Haswell i5 machine, a Haswell i7
> > machine, 
> > > > and  
> > > > > a IvyBridge i5 machine. Turned out on an Haswell i5 machine
> > the 
> > > > code  
> > > > > also runs fast. Only the Haswell i7 machine is the slow one. 
> > > > This  
> > > > > really drove me nuts. First I thought it was the OS, then
> > the  
> > > > > architecture, and now its just from i5 to i7.... Anyways, I 
> > > > don't  
> > > > > know anything about x86 assembly, but the julia 0.45 code is
> > the 
> > > > same  
> > > > > on all machines. However, for the dot product, the 0.5 code
> > has  
> > > > > already 2 different instructions on the i5 vs. the i7 (line 
> > > > 44&47).  
> > > > > For the cis call also (line 149...). And the IvyBridge i5
> > code 
> > > > is  
> > > > > similar to the Haswell i5. I included also versioninfo() at
> > the 
> > > > top  
> > > > > of the file. So you could just look at a vimdiff of the
> > julia0.5  
> > > > > files... Can anyone make sense out of this?  
> > > > I'm definitely not an expert in assembly, but that additional
> > leaq  
> > > > instruction on line 44, and the additional movq instructions
> > on 
> > > > line  
> > > > 111, 151 and 152 really look weird  
> > > > 
> > > > Could you do the same test with the binary tarballs? If the 
> > > > difference  
> > > > persists, you should open an issue on GitHub to track this.  
> > > > 
> > > > BTW, please wrap the fist call in a function to ensure it is  
> > > > specialized for the arguments types, i.e.:  
> > > > 
> > > > f(r, x) = r * x[1,:]  
> > > > @code_native f(r, x)  
> > > > 
> > > > Also, please check whether you still see the difference with
> > this 
> > > > code:  
> > > > g(r, x) = r * x  
> > > > @code_native g(r, x[1,:])  
> > > > 
> > > > What are the types of r and x? Could you provide a simple 
> > > > reproducible example with dummy values?  
> > > > 
> > > > > The binary tarballs I will still test. If I remove the cis() 
> > > > call,  
> > > > > the difference is hard to tell, the loop is ~10times faster
> > and 
> > > > more  
> > > > > or less all around 5ms. For the whole loop with cis() call,
> > from 
> > > > i5  
> > > > > to i7 the difference is ~ 50ms on i5 to 90ms on i7.  
> > > > >  
> > > > > Shall I also post the julia 0.4 code?  
> > > > If it's identical for all machines, I don't think it's
> > needed.  
> > > > 
> > > > 
> > > > Regards  
> > > > 
> > > > 
> > > > > cheers, Johannes  
> > > > >  
> > > > >  
> > > > >  
> > > > > > Le mercredi 30 mars 2016 à 15:16 -0700, Johannes Wagner a 
> > > > écrit :   
> > > > > > >   
> > > > > > >   
> > > > > > > > Le mercredi 30 mars 2016 à 04:43 -0700, Johannes Wagner
> > a 
> > > > écrit :    
> > > > > > > > > Sorry for not having expressed myself clearly, I
> > meant 
> > > > the latest    
> > > > > > > > > version of fedora to work fine (24 development). I
> > always 
> > > > used the    
> > > > > > > > > latest julia nightly available on the copr nalimilan 
> > > > repo. Right now    
> > > > > > > > > that is: 0.5.0-dev+3292, Commit 9d527c5*, all use    
> > > > > > > > > LLVM: libLLVM-3.7.1 (ORCJIT, haswell)    
> > > > > > > > >    
> > > > > > > > > peakflops on all machines (hardware identical) is 
> > > > ~1.2..1.5e11.      
> > > > > > > > >    
> > > > > > > > > Fedora 22&23 with julia 0.5 is ~50% slower then 0.4,
> > only 
> > > > on fedora    
> > > > > > > > > 24 julia 0.5 is  faster compared to julia 0.4.    
> > > > > > > > Could you try to find a simple code to reproduce the 
> > > > problem? In    
> > > > > > > > particular, it would be useful to check whether this
> > comes 
> > > > from    
> > > > > > > > OpenBLAS differences or whether it also happens with
> > pure 
> > > > Julia code    
> > > > > > > > (typical operations which depend on BLAS are matrix 
> > > > multiplication, as    
> > > > > > > > well as most of linear algebra). Normally, 0.4 and 0.5 
> > > > should use the    
> > > > > > > > same BLAS, but who knows...    
> > > > > > > well thats what I did, and the 3 simple calls inside the
> > loop 
> > > > are   
> > > > > > > more or less same speed. only the whole loop seems
> > slower. 
> > > > See my   
> > > > > > > code sample fromanswer march 8th (code gets in same 
> > > > proportions   
> > > > > > > faster when exp(im .* dotprods) is replaced by
> > cis(dotprods) 
> > > > ).    
> > > > > > > So I don't know what I can do then...     
> > > > > > Sorry, somehow I had missed that message. This indeed
> > looks 
> > > > like a code   
> > > > > > generation issue in Julia/LLVM.   
> > > > > >  
> > > > > > > > Can you also confirm that all versioninfo() fields are
> > the 
> > > > same for all    
> > > > > > > > three machines, both for 0.4 and 0.5? We must envision
> > the 
> > > > possibility    
> > > > > > > > that the differences actually come from 0.4.    
> > > > > > > ohoh, right! just noticed that my fedora 24 machine was
> > an 
> > > > ivy bridge   
> > > > > > > which works fast:   
> > > > > > >   
> > > > > > > Julia Version 0.5.0-dev+3292   
> > > > > > > Commit 9d527c5* (2016-03-28 06:55 UTC)   
> > > > > > > Platform Info:   
> > > > > > >   System: Linux (x86_64-redhat-linux)   
> > > > > > >   CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz   
> > > > > > >   WORD_SIZE: 64   
> > > > > > >   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY
> > Sandybridge)   
> > > > > > >   LAPACK: libopenblasp.so.0   
> > > > > > >   LIBM: libopenlibm   
> > > > > > >   LLVM: libLLVM-3.7.1 (ORCJIT, ivybridge)   
> > > > > > >   
> > > > > > > and the other ones with fed22/23 are haswell, which work 
> > > > slow:   
> > > > > > >   
> > > > > > > Julia Version 0.5.0-dev+3292   
> > > > > > > Commit 9d527c5* (2016-03-28 06:55 UTC)   
> > > > > > > Platform Info:   
> > > > > > >   System: Linux (x86_64-redhat-linux)   
> > > > > > >   CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz   
> > > > > > >   WORD_SIZE: 64   
> > > > > > >   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)   
> > > > > > >   LAPACK: libopenblasp.so.0   
> > > > > > >   LIBM: libopenlibm   
> > > > > > >   LLVM: libLLVM-3.7.1 (ORCJIT, haswell)   
> > > > > > >   
> > > > > > > I just booted an fedora 23 on the ivy bridge machine and
> > it's 
> > > > also fast.    
> > > > > > >     
> > > > > > > Now if I use julia 0.45 on both architectures:   
> > > > > > >   
> > > > > > > Julia Version 0.4.5   
> > > > > > > Commit 2ac304d* (2016-03-18 00:58 UTC)   
> > > > > > > Platform Info:   
> > > > > > >   System: Linux (x86_64-redhat-linux)   
> > > > > > >   CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz   
> > > > > > >   WORD_SIZE: 64   
> > > > > > >   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)   
> > > > > > >   LAPACK: libopenblasp.so.0   
> > > > > > >   LIBM: libopenlibm   
> > > > > > >   LLVM: libLLVM-3.3   
> > > > > > >   
> > > > > > > and:   
> > > > > > >   
> > > > > > > Julia Version 0.4.5   
> > > > > > > Commit 2ac304d* (2016-03-18 00:58 UTC)   
> > > > > > > Platform Info:   
> > > > > > >   System: Linux (x86_64-redhat-linux)   
> > > > > > >   CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz   
> > > > > > >   WORD_SIZE: 64   
> > > > > > >   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY
> > Sandybridge)   
> > > > > > >   LAPACK: libopenblasp.so.0   
> > > > > > >   LIBM: libopenlibm   
> > > > > > >   LLVM: libLLVM-3.3   
> > > > > > >   
> > > > > > > there is no speed difference apart from the ~10% or so
> > from 
> > > > the   
> > > > > > > faster haswell machine. So could perhaps be haswell
> > hardware 
> > > > target   
> > > > > > > specific with the change from llvm 3.3 to 3.7.1? Is
> > there 
> > > > anything   
> > > > > > > else I could provide?   
> > > > > > This is certainly an interesting finding. Could you paste 
> > > > somewhere the   
> > > > > > output of @code_native for your function on Sandybridge
> > vs. 
> > > > Haswell,   
> > > > > > for both 0.4 and 0.5?   
> > > > > >  
> > > > > > It would also be useful to check whether the same
> > difference 
> > > > appears if   
> > > > > > you use the generic binary tarballs from http://julialang.o
> > rg/d 
> > > > ownloads   
> > > > > > .   
> > > > > >  
> > > > > > Finally, do you get the same result if you remove the call
> > to 
> > > > exp()   
> > > > > > from the loop? (This is the only external function, so it 
> > > > shouldn't be   
> > > > > > affected by changes in Julia.)   
> > > > > >  
> > > > > >  
> > > > > > Regards   
> > > > > >  
> > > > > >  
> > > > > > > Best, Johannes   
> > > > > > >   
> > > > > > > >  Regards    
> > > > > > >   
> > > > > > >   
> > > > > > > > > Le mercredi 16 mars 2016 à 09:25 -0700, Johannes
> > Wagner a 
> > > > écrit :     
> > > > > > > > > > just a little update. Tested some other fedoras:
> > Fedora 
> > > > 22 with llvm     
> > > > > > > > > > 3.8 is also slow with julia 0.5, whereas a fedora
> > 24 
> > > > branch with llvm     
> > > > > > > > > > 3.7 is faster on julia 0.5 compared to julia 0.4,
> > as it 
> > > > should be     
> > > > > > > > > > (speedup from inner loop parts translated into
> > speedup 
> > > > to whole     
> > > > > > > > > > function).     
> > > > > > > > > >     
> > > > > > > > > > don't know if anyone cares about that... At least
> > the 
> > > > latest version     
> > > > > > > > > > seems to work fine, hope it stays like this into
> > the 
> > > > final fedora 24     
> > > > > > > > > What's the "latest version"? git built from source or
> > RPM 
> > > > nightlies?     
> > > > > > > > > With which LLVM version for each?     
> > > > > > > > >    
> > > > > > > > > If from the RPMs, I've switched them to LLVM 3.8 for
> > a 
> > > > few days, and     
> > > > > > > > > went back to 3.7 because of a build failure. So that 
> > > > might explain the     
> > > > > > > > > difference. You can install the last version which
> > built 
> > > > with LLVM 3.8     
> > > > > > > > > manually from here:     
> > > > > > > > > https://copr-be.cloud.fedoraproject.org/results/nalim
> > ilan 
> > > > /julia-nightlies/fedora-23-x86_64/00167549-julia/     
> > > > > > > > >    
> > > > > > > > > It would be interesting to compare it with the
> > latest 
> > > > nightly with 3.7.     
> > > > > > > > >    
> > > > > > > > >    
> > > > > > > > > Regards     
> > > > > > > > >    
> > > > > > > > >    
> > > > > > > > >    
> > > > > > > > > > > hey guys,     
> > > > > > > > > > > I just experienced something weird. I have some
> > code 
> > > > that runs fine     
> > > > > > > > > > > on 0.43, then I updated to 0.5dev to test the
> > new 
> > > > Arrays, run same     
> > > > > > > > > > > code and noticed it got about ~50% slower. Then
> > I 
> > > > downgraded back     
> > > > > > > > > > > to 0.43, ran the old code, but speed remained
> > slow. I 
> > > > noticed while     
> > > > > > > > > > > reinstalling 0.43, openblas-threads didn't get 
> > > > isntalled along with     
> > > > > > > > > > > it. So I manually installed it, but no
> > change.      
> > > > > > > > > > > Does anyone has an idea what could be going on?
> > LLVM 
> > > > on fedora23 is     
> > > > > > > > > > > 3.7     
> > > > > > > > > > >     
> > > > > > > > > > > Cheers, Johannes     
> > > > > > > > > > >     

Reply via email to