Le mardi 05 avril 2016 à 10:18 -0700, Johannes Wagner a écrit :
> hey Milan,
> so consider following code:
> 
> Pkg.clone("git://github.com/kbarbary/TimeIt.jl.git")
> using TimeIt
> 
> v = rand(3)
> r = rand(6000,3)
> x = linspace(0.0, 10.0, 500) * (v./sqrt(sumabs2(v)))'
> 
> dotprods = r * x[2,:]
> imexp    = cis(dotprods)
> sumprod  = sum(imexp) * sum(conj(imexp))
> 
> f(r, x) = r * x[2,:]    
> g(r, x) = r * x'
> h(imexp)    = sum(imexp) * sum(conj(imexp))
> 
> function s(r, x)
>         result = zeros(size(x,1))
>         for i = 1:size(x,1)
>                 imexp    = cis(r * x[i,:])
>                 result[i]= sum(imexp) * sum(conj(imexp))
>         end
>         return result
> end
> 
> @timeit zeros(size(x,1))
> @timeit f(r,x)
> @timeit g(r,x)
> @timeit cis(dotprods)
> @timeit h(imexp)
> @timeit s(r,x)
> 
> @code_native f(r,x)
> @code_native g(r,x)
> @code_native cis(dotprods)
> @code_native h(imexp)
> @code_native s(r,x)
> 
> and I attached the output of the last @code_native s(r,x) as text
> files for the binary tarball, as well as the latest nalimilan update.
> For the whole function s, the exported code looks actually the same
> everywhere.
> But s(r,x) is the one that is considerable slower on the i7 than the
> i5, whereas all the other timed calls are more or less same speed on
> i5 and i7. Here are the timings in the same order as above (all run
> repeatedly to not have compile time in it for last one):
> 
> i7:
> 1000000 loops, best of 3: 871.68 ns per loop
> 10000 loops, best of 3: 10.84 µs per loop
> 100 loops, best of 3: 5.19 ms per loop
> 10000 loops, best of 3: 71.35 µs per loop
> 10000 loops, best of 3: 26.65 µs per loop
> 1 loops, best of 3: 159.99 ms per loop
> 
> i5:
> 100000 loops, best of 3: 1.01 µs per loop
> 10000 loops, best of 3: 10.93 µs per loop
> 100 loops, best of 3: 5.09 ms per loop
> 10000 loops, best of 3: 75.93 µs per loop
> 10000 loops, best of 3: 29.23 µs per loop
> 1 loops, best of 3: 103.70 ms per loop
> 
> So based on inside s(r,x) calls, the i7 should be faster, but the
> whole s(r,x) is slower. Still clueless... And don't know how to
> further pin this down...
Thanks. I think you got mixed up with the different files, as the
versioninfo() output indicates. Anyway, there's enough info to check
which file corresponds to which Julia version, so that's OK. Indeed,
when comparing the tests with binary tarballs, there's a call
to jl_alloc_array_1d with the i7 (julia050_tarball-haswell-i7.txt),
which is not present with the i5 (incorrectly named julia050_haswell-
i7.txt). This is really unexpected.

Could you file an issue on GitHub with a summary of what we've found
(essentially your message), as well as links to 3 Gists giving the code
and the contents of the two .txt files I mentioned above? That would be
very helpful. Do not mention the Fedora packages at all, as the binary
tarballs are closer to what Julia developers use.


Regards


> cheers, Johannes
> 
> 
> 
> 
> > Le lundi 04 avril 2016 à 10:36 -0700, Johannes Wagner a écrit : 
> > > hey guys, 
> > > so attached you find text files with @code_native output for the 
> > > instructions  
> > > - r * x[1,:] 
> > > - cis(imexp) 
> > > - sum(imexp) * sum(conj(imexp)) 
> > > 
> > > for julia 0.5.  
> > > 
> > > Hardware I run on is a Haswell i5 machine, a Haswell i7 machine,
> > and 
> > > a IvyBridge i5 machine. Turned out on an Haswell i5 machine the
> > code 
> > > also runs fast. Only the Haswell i7 machine is the slow one.
> > This 
> > > really drove me nuts. First I thought it was the OS, then the 
> > > architecture, and now its just from i5 to i7.... Anyways, I
> > don't 
> > > know anything about x86 assembly, but the julia 0.45 code is the
> > same 
> > > on all machines. However, for the dot product, the 0.5 code has 
> > > already 2 different instructions on the i5 vs. the i7 (line
> > 44&47). 
> > > For the cis call also (line 149...). And the IvyBridge i5 code
> > is 
> > > similar to the Haswell i5. I included also versioninfo() at the
> > top 
> > > of the file. So you could just look at a vimdiff of the julia0.5 
> > > files... Can anyone make sense out of this? 
> > I'm definitely not an expert in assembly, but that additional leaq 
> > instruction on line 44, and the additional movq instructions on
> > line 
> > 111, 151 and 152 really look weird 
> > 
> > Could you do the same test with the binary tarballs? If the
> > difference 
> > persists, you should open an issue on GitHub to track this. 
> > 
> > BTW, please wrap the fist call in a function to ensure it is 
> > specialized for the arguments types, i.e.: 
> > 
> > f(r, x) = r * x[1,:] 
> > @code_native f(r, x) 
> > 
> > Also, please check whether you still see the difference with this
> > code: 
> > g(r, x) = r * x 
> > @code_native g(r, x[1,:]) 
> > 
> > What are the types of r and x? Could you provide a simple
> > reproducible example with dummy values? 
> > 
> > > The binary tarballs I will still test. If I remove the cis()
> > call, 
> > > the difference is hard to tell, the loop is ~10times faster and
> > more 
> > > or less all around 5ms. For the whole loop with cis() call, from
> > i5 
> > > to i7 the difference is ~ 50ms on i5 to 90ms on i7. 
> > > 
> > > Shall I also post the julia 0.4 code? 
> > If it's identical for all machines, I don't think it's needed. 
> > 
> > 
> > Regards 
> > 
> > 
> > > cheers, Johannes 
> > > 
> > > 
> > > 
> > > > Le mercredi 30 mars 2016 à 15:16 -0700, Johannes Wagner a
> > écrit :  
> > > > >  
> > > > >  
> > > > > > Le mercredi 30 mars 2016 à 04:43 -0700, Johannes Wagner a
> > écrit :   
> > > > > > > Sorry for not having expressed myself clearly, I meant
> > the latest   
> > > > > > > version of fedora to work fine (24 development). I always
> > used the   
> > > > > > > latest julia nightly available on the copr nalimilan
> > repo. Right now   
> > > > > > > that is: 0.5.0-dev+3292, Commit 9d527c5*, all use   
> > > > > > > LLVM: libLLVM-3.7.1 (ORCJIT, haswell)   
> > > > > > >   
> > > > > > > peakflops on all machines (hardware identical) is
> > ~1.2..1.5e11.     
> > > > > > >   
> > > > > > > Fedora 22&23 with julia 0.5 is ~50% slower then 0.4, only
> > on fedora   
> > > > > > > 24 julia 0.5 is  faster compared to julia 0.4.   
> > > > > > Could you try to find a simple code to reproduce the
> > problem? In   
> > > > > > particular, it would be useful to check whether this comes
> > from   
> > > > > > OpenBLAS differences or whether it also happens with pure
> > Julia code   
> > > > > > (typical operations which depend on BLAS are matrix
> > multiplication, as   
> > > > > > well as most of linear algebra). Normally, 0.4 and 0.5
> > should use the   
> > > > > > same BLAS, but who knows...   
> > > > > well thats what I did, and the 3 simple calls inside the loop
> > are  
> > > > > more or less same speed. only the whole loop seems slower.
> > See my  
> > > > > code sample fromanswer march 8th (code gets in same
> > proportions  
> > > > > faster when exp(im .* dotprods) is replaced by cis(dotprods)
> > ).   
> > > > > So I don't know what I can do then...    
> > > > Sorry, somehow I had missed that message. This indeed looks
> > like a code  
> > > > generation issue in Julia/LLVM.  
> > > > 
> > > > > > Can you also confirm that all versioninfo() fields are the
> > same for all   
> > > > > > three machines, both for 0.4 and 0.5? We must envision the
> > possibility   
> > > > > > that the differences actually come from 0.4.   
> > > > > ohoh, right! just noticed that my fedora 24 machine was an
> > ivy bridge  
> > > > > which works fast:  
> > > > >  
> > > > > Julia Version 0.5.0-dev+3292  
> > > > > Commit 9d527c5* (2016-03-28 06:55 UTC)  
> > > > > Platform Info:  
> > > > >   System: Linux (x86_64-redhat-linux)  
> > > > >   CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz  
> > > > >   WORD_SIZE: 64  
> > > > >   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Sandybridge)  
> > > > >   LAPACK: libopenblasp.so.0  
> > > > >   LIBM: libopenlibm  
> > > > >   LLVM: libLLVM-3.7.1 (ORCJIT, ivybridge)  
> > > > >  
> > > > > and the other ones with fed22/23 are haswell, which work
> > slow:  
> > > > >  
> > > > > Julia Version 0.5.0-dev+3292  
> > > > > Commit 9d527c5* (2016-03-28 06:55 UTC)  
> > > > > Platform Info:  
> > > > >   System: Linux (x86_64-redhat-linux)  
> > > > >   CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz  
> > > > >   WORD_SIZE: 64  
> > > > >   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)  
> > > > >   LAPACK: libopenblasp.so.0  
> > > > >   LIBM: libopenlibm  
> > > > >   LLVM: libLLVM-3.7.1 (ORCJIT, haswell)  
> > > > >  
> > > > > I just booted an fedora 23 on the ivy bridge machine and it's
> > also fast.   
> > > > >    
> > > > > Now if I use julia 0.45 on both architectures:  
> > > > >  
> > > > > Julia Version 0.4.5  
> > > > > Commit 2ac304d* (2016-03-18 00:58 UTC)  
> > > > > Platform Info:  
> > > > >   System: Linux (x86_64-redhat-linux)  
> > > > >   CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz  
> > > > >   WORD_SIZE: 64  
> > > > >   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)  
> > > > >   LAPACK: libopenblasp.so.0  
> > > > >   LIBM: libopenlibm  
> > > > >   LLVM: libLLVM-3.3  
> > > > >  
> > > > > and:  
> > > > >  
> > > > > Julia Version 0.4.5  
> > > > > Commit 2ac304d* (2016-03-18 00:58 UTC)  
> > > > > Platform Info:  
> > > > >   System: Linux (x86_64-redhat-linux)  
> > > > >   CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz  
> > > > >   WORD_SIZE: 64  
> > > > >   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Sandybridge)  
> > > > >   LAPACK: libopenblasp.so.0  
> > > > >   LIBM: libopenlibm  
> > > > >   LLVM: libLLVM-3.3  
> > > > >  
> > > > > there is no speed difference apart from the ~10% or so from
> > the  
> > > > > faster haswell machine. So could perhaps be haswell hardware
> > target  
> > > > > specific with the change from llvm 3.3 to 3.7.1? Is there
> > anything  
> > > > > else I could provide?  
> > > > This is certainly an interesting finding. Could you paste
> > somewhere the  
> > > > output of @code_native for your function on Sandybridge vs.
> > Haswell,  
> > > > for both 0.4 and 0.5?  
> > > > 
> > > > It would also be useful to check whether the same difference
> > appears if  
> > > > you use the generic binary tarballs from http://julialang.org/d
> > ownloads  
> > > > .  
> > > > 
> > > > Finally, do you get the same result if you remove the call to
> > exp()  
> > > > from the loop? (This is the only external function, so it
> > shouldn't be  
> > > > affected by changes in Julia.)  
> > > > 
> > > > 
> > > > Regards  
> > > > 
> > > > 
> > > > > Best, Johannes  
> > > > >  
> > > > > >  Regards   
> > > > >  
> > > > >  
> > > > > > > Le mercredi 16 mars 2016 à 09:25 -0700, Johannes Wagner a
> > écrit :    
> > > > > > > > just a little update. Tested some other fedoras: Fedora
> > 22 with llvm    
> > > > > > > > 3.8 is also slow with julia 0.5, whereas a fedora 24
> > branch with llvm    
> > > > > > > > 3.7 is faster on julia 0.5 compared to julia 0.4, as it
> > should be    
> > > > > > > > (speedup from inner loop parts translated into speedup
> > to whole    
> > > > > > > > function).    
> > > > > > > >    
> > > > > > > > don't know if anyone cares about that... At least the
> > latest version    
> > > > > > > > seems to work fine, hope it stays like this into the
> > final fedora 24    
> > > > > > > What's the "latest version"? git built from source or RPM
> > nightlies?    
> > > > > > > With which LLVM version for each?    
> > > > > > >   
> > > > > > > If from the RPMs, I've switched them to LLVM 3.8 for a
> > few days, and    
> > > > > > > went back to 3.7 because of a build failure. So that
> > might explain the    
> > > > > > > difference. You can install the last version which built
> > with LLVM 3.8    
> > > > > > > manually from here:    
> > > > > > > https://copr-be.cloud.fedoraproject.org/results/nalimilan
> > /julia-nightlies/fedora-23-x86_64/00167549-julia/    
> > > > > > >   
> > > > > > > It would be interesting to compare it with the latest
> > nightly with 3.7.    
> > > > > > >   
> > > > > > >   
> > > > > > > Regards    
> > > > > > >   
> > > > > > >   
> > > > > > >   
> > > > > > > > > hey guys,    
> > > > > > > > > I just experienced something weird. I have some code
> > that runs fine    
> > > > > > > > > on 0.43, then I updated to 0.5dev to test the new
> > Arrays, run same    
> > > > > > > > > code and noticed it got about ~50% slower. Then I
> > downgraded back    
> > > > > > > > > to 0.43, ran the old code, but speed remained slow. I
> > noticed while    
> > > > > > > > > reinstalling 0.43, openblas-threads didn't get
> > isntalled along with    
> > > > > > > > > it. So I manually installed it, but no change.     
> > > > > > > > > Does anyone has an idea what could be going on? LLVM
> > on fedora23 is    
> > > > > > > > > 3.7    
> > > > > > > > >    
> > > > > > > > > Cheers, Johannes    
> > > > > > > > >    

Reply via email to