Re: [julia-users] regression from 0.43 to 0.5dev, and back to 0.43 on fedora23

2016-04-11 Thread Johannes Wagner
Ok, but thanks a lot for directing me around! Now at least I can disable 
Hyperthreading and stuff runs fast...

cheers, Johannes

On Wednesday, April 6, 2016 at 5:48:07 PM UTC+2, Milan Bouchet-Valat wrote:
>
> Le mercredi 06 avril 2016 à 17:02 +0200, Johannes Wagner a écrit : 
> > 
> > > 
> > > On 6 Apr  2016, at 4:46 PM, Milan Bouchet-Valat  > wrote: 
> > > 
> > > Le mercredi 06 avril 2016 à 07:25 -0700, Johannes Wagner a écrit : 
> > > > 
> > > > and last update, I disabled Hyperthreading on the i7, and now it 
> > > > performes as expected. 
> > > > 
> > > > i7 no HT: 
> > > > 
> > > > 100 loops, best of 3: 879.57 ns per loop 
> > > > 10 loops, best of 3: 9.88 µs per loop 
> > > > 100 loops, best of 3: 4.46 ms per loop 
> > > > 1 loops, best of 3: 69.89 µs per loop 
> > > > 1 loops, best of 3: 26.67 µs per loop 
> > > > 10 loops, best of 3: 95.08 ms per loop 
> > > > 
> > > > i7 with HT: 
> > > > 
> > > > 100 loops, best of 3: 871.68 ns per loop 
> > > > 1 loops, best of 3: 10.84 µs per loop 
> > > > 100 loops, best of 3: 5.19 ms per loop 
> > > > 1 loops, best of 3: 71.35 µs per loop 
> > > > 1 loops, best of 3: 26.65 µs per loop 
> > > > 1 loops, best of 3: 159.99 ms per loop 
> > > > 
> > > > So all calls inside the loop are the same speed, but the whole loop, 
> > > > with identical assembly code is ~60% slower if HT is enabled. Where 
> > > > can this problem then arise from? LLVM? or thread pinning in the OS? 
> > > > Probably not a julia problem then... 
> > > Indeed, in the last assembly output you sent, there are no differences 
> > > between i5 and i7 (as expected). So this isn't Julia's nor LLVM's 
> > > fault. No idea whether there might be an issue with the CPU itself, 
> but 
> > > it's quite surprising. 
> > Run it on a 2nd i7 machine. Same behavior, so definitely not a faulty 
> > cpu. Do you have any other idea what to do? Just leave it as is and 
> > now use julia with disabled hyper threading is not really 
> > satisfactory... 
> Sorry, I'm clueless. You could ask on forums dedicated to CPUs or on 
> Intel forums. You could also try with a different OS, just in case. 
>
>
>
> Regards 
>
> > > 
> > > Regards 
> > > 
> > > 
> > > > 
> > > > > 
> > > > > Le mardi 05 avril 2016 à 10:18 -0700, Johannes Wagner a écrit :  
> > > > > > 
> > > > > > hey Milan,  
> > > > > > so consider following code:  
> > > > > >   
> > > > > > Pkg.clone("git://github.com/kbarbary/TimeIt.jl.git")  
> > > > > > using TimeIt  
> > > > > >   
> > > > > > v = rand(3)  
> > > > > > r = rand(6000,3)  
> > > > > > x = linspace(0.0, 10.0, 500) * (v./sqrt(sumabs2(v)))'  
> > > > > >   
> > > > > > dotprods = r * x[2,:]  
> > > > > > imexp= cis(dotprods)  
> > > > > > sumprod  = sum(imexp) * sum(conj(imexp))  
> > > > > >   
> > > > > > f(r, x) = r * x[2,:]  
> > > > > > g(r, x) = r * x'  
> > > > > > h(imexp)= sum(imexp) * sum(conj(imexp))  
> > > > > >   
> > > > > > function s(r, x)  
> > > > > > result = zeros(size(x,1))  
> > > > > > for i = 1:size(x,1)  
> > > > > > imexp= cis(r * x[i,:])  
> > > > > > result[i]= sum(imexp) * sum(conj(imexp))  
> > > > > > end  
> > > > > > return result  
> > > > > > end  
> > > > > >   
> > > > > > @timeit zeros(size(x,1))  
> > > > > > @timeit f(r,x)  
> > > > > > @timeit g(r,x)  
> > > > > > @timeit cis(dotprods)  
> > > > > > @timeit h(imexp)  
> > > > > > @timeit s(r,x)  
> > > > > >   
> > > > > > @code_native f(r,x)  
> > > > > > @code_native g(r,x)  
> > > > > > @code_native cis(dotprods)  
> > > > > > @code_native h(imexp)  
> > > > > > @code_native s(r,x)  
> > > > >

Re: [julia-users] regression from 0.43 to 0.5dev, and back to 0.43 on fedora23

2016-04-06 Thread Johannes Wagner


> On 6 Apr  2016, at 4:46 PM, Milan Bouchet-Valat  wrote:
> 
> Le mercredi 06 avril 2016 à 07:25 -0700, Johannes Wagner a écrit :
>> and last update, I disabled Hyperthreading on the i7, and now it
>> performes as expected.
>> 
>> i7 no HT:
>> 
>> 100 loops, best of 3: 879.57 ns per loop
>> 10 loops, best of 3: 9.88 µs per loop
>> 100 loops, best of 3: 4.46 ms per loop
>> 1 loops, best of 3: 69.89 µs per loop
>> 1 loops, best of 3: 26.67 µs per loop
>> 10 loops, best of 3: 95.08 ms per loop
>> 
>> i7 with HT:
>> 
>> 100 loops, best of 3: 871.68 ns per loop
>> 1 loops, best of 3: 10.84 µs per loop
>> 100 loops, best of 3: 5.19 ms per loop
>> 1 loops, best of 3: 71.35 µs per loop
>> 1 loops, best of 3: 26.65 µs per loop
>> 1 loops, best of 3: 159.99 ms per loop
>> 
>> So all calls inside the loop are the same speed, but the whole loop,
>> with identical assembly code is ~60% slower if HT is enabled. Where
>> can this problem then arise from? LLVM? or thread pinning in the OS?
>> Probably not a julia problem then...
> Indeed, in the last assembly output you sent, there are no differences
> between i5 and i7 (as expected). So this isn't Julia's nor LLVM's
> fault. No idea whether there might be an issue with the CPU itself, but
> it's quite surprising.

Run it on a 2nd i7 machine. Same behavior, so definitely not a faulty cpu. Do 
you have any other idea what to do? Just leave it as is and now use julia with 
disabled hyper threading is not really satisfactory...


> Regards
> 
> 
>>> Le mardi 05 avril 2016 à 10:18 -0700, Johannes Wagner a écrit : 
>>>> hey Milan, 
>>>> so consider following code: 
>>>>  
>>>> Pkg.clone("git://github.com/kbarbary/TimeIt.jl.git") 
>>>> using TimeIt 
>>>>  
>>>> v = rand(3) 
>>>> r = rand(6000,3) 
>>>> x = linspace(0.0, 10.0, 500) * (v./sqrt(sumabs2(v)))' 
>>>>  
>>>> dotprods = r * x[2,:] 
>>>> imexp= cis(dotprods) 
>>>> sumprod  = sum(imexp) * sum(conj(imexp)) 
>>>>  
>>>> f(r, x) = r * x[2,:] 
>>>> g(r, x) = r * x' 
>>>> h(imexp)= sum(imexp) * sum(conj(imexp)) 
>>>>  
>>>> function s(r, x) 
>>>> result = zeros(size(x,1)) 
>>>> for i = 1:size(x,1) 
>>>> imexp= cis(r * x[i,:]) 
>>>> result[i]= sum(imexp) * sum(conj(imexp)) 
>>>> end 
>>>> return result 
>>>> end 
>>>>  
>>>> @timeit zeros(size(x,1)) 
>>>> @timeit f(r,x) 
>>>> @timeit g(r,x) 
>>>> @timeit cis(dotprods) 
>>>> @timeit h(imexp) 
>>>> @timeit s(r,x) 
>>>>  
>>>> @code_native f(r,x) 
>>>> @code_native g(r,x) 
>>>> @code_native cis(dotprods) 
>>>> @code_native h(imexp) 
>>>> @code_native s(r,x) 
>>>>  
>>>> and I attached the output of the last @code_native s(r,x) as
>>> text 
>>>> files for the binary tarball, as well as the latest nalimilan
>>> update. 
>>>> For the whole function s, the exported code looks actually the
>>> same 
>>>> everywhere. 
>>>> But s(r,x) is the one that is considerable slower on the i7 than
>>> the 
>>>> i5, whereas all the other timed calls are more or less same speed
>>> on 
>>>> i5 and i7. Here are the timings in the same order as above (all
>>> run 
>>>> repeatedly to not have compile time in it for last one): 
>>>>  
>>>> i7: 
>>>> 100 loops, best of 3: 871.68 ns per loop 
>>>> 1 loops, best of 3: 10.84 µs per loop 
>>>> 100 loops, best of 3: 5.19 ms per loop 
>>>> 1 loops, best of 3: 71.35 µs per loop 
>>>> 1 loops, best of 3: 26.65 µs per loop 
>>>> 1 loops, best of 3: 159.99 ms per loop 
>>>>  
>>>> i5: 
>>>> 10 loops, best of 3: 1.01 µs per loop 
>>>> 1 loops, best of 3: 10.93 µs per loop 
>>>> 100 loops, best of 3: 5.09 ms per loop 
>>>> 1 loops, best of 3: 75.93 µs per loop 
>>>> 1 loops, best of 3: 29.23 µs per loop 
>>>> 1 loops, best of 3: 103.70 ms per loop 
>>>>  
>>>> So based on inside s(r,x) calls, the i7 should be faster, but
>>> the 
>>>> whole s(

Re: [julia-users] Re: regression from 0.43 to 0.5dev, and back to 0.43 on fedora23

2016-04-06 Thread Johannes Wagner
and last update, I disabled Hyperthreading on the i7, and now it performes 
as expected.

i7 no HT:

100 loops, best of 3: 879.57 ns per loop
10 loops, best of 3: 9.88 µs per loop
100 loops, best of 3: 4.46 ms per loop
1 loops, best of 3: 69.89 µs per loop
1 loops, best of 3: 26.67 µs per loop
10 loops, best of 3: 95.08 ms per loop

i7 with HT:

100 loops, best of 3: 871.68 ns per loop
1 loops, best of 3: 10.84 µs per loop
100 loops, best of 3: 5.19 ms per loop
1 loops, best of 3: 71.35 µs per loop
1 loops, best of 3: 26.65 µs per loop
1 loops, best of 3: 159.99 ms per loop

So all calls inside the loop are the same speed, but the whole loop, with 
identical assembly code is ~60% slower if HT is enabled. Where can this 
problem then arise from? LLVM? or thread pinning in the OS? Probably not a 
julia problem then...



On Tuesday, April 5, 2016 at 7:54:16 PM UTC+2, Milan Bouchet-Valat wrote:
>
> Le mardi 05 avril 2016 à 10:18 -0700, Johannes Wagner a écrit : 
> > hey Milan, 
> > so consider following code: 
> > 
> > Pkg.clone("git://github.com/kbarbary/TimeIt.jl.git") 
> > using TimeIt 
> > 
> > v = rand(3) 
> > r = rand(6000,3) 
> > x = linspace(0.0, 10.0, 500) * (v./sqrt(sumabs2(v)))' 
> > 
> > dotprods = r * x[2,:] 
> > imexp= cis(dotprods) 
> > sumprod  = sum(imexp) * sum(conj(imexp)) 
> > 
> > f(r, x) = r * x[2,:] 
> > g(r, x) = r * x' 
> > h(imexp)= sum(imexp) * sum(conj(imexp)) 
> > 
> > function s(r, x) 
> > result = zeros(size(x,1)) 
> > for i = 1:size(x,1) 
> > imexp= cis(r * x[i,:]) 
> > result[i]= sum(imexp) * sum(conj(imexp)) 
> > end 
> > return result 
> > end 
> > 
> > @timeit zeros(size(x,1)) 
> > @timeit f(r,x) 
> > @timeit g(r,x) 
> > @timeit cis(dotprods) 
> > @timeit h(imexp) 
> > @timeit s(r,x) 
> > 
> > @code_native f(r,x) 
> > @code_native g(r,x) 
> > @code_native cis(dotprods) 
> > @code_native h(imexp) 
> > @code_native s(r,x) 
> > 
> > and I attached the output of the last @code_native s(r,x) as text 
> > files for the binary tarball, as well as the latest nalimilan update. 
> > For the whole function s, the exported code looks actually the same 
> > everywhere. 
> > But s(r,x) is the one that is considerable slower on the i7 than the 
> > i5, whereas all the other timed calls are more or less same speed on 
> > i5 and i7. Here are the timings in the same order as above (all run 
> > repeatedly to not have compile time in it for last one): 
> > 
> > i7: 
> > 100 loops, best of 3: 871.68 ns per loop 
> > 1 loops, best of 3: 10.84 µs per loop 
> > 100 loops, best of 3: 5.19 ms per loop 
> > 1 loops, best of 3: 71.35 µs per loop 
> > 1 loops, best of 3: 26.65 µs per loop 
> > 1 loops, best of 3: 159.99 ms per loop 
> > 
> > i5: 
> > 10 loops, best of 3: 1.01 µs per loop 
> > 1 loops, best of 3: 10.93 µs per loop 
> > 100 loops, best of 3: 5.09 ms per loop 
> > 1 loops, best of 3: 75.93 µs per loop 
> > 1 loops, best of 3: 29.23 µs per loop 
> > 1 loops, best of 3: 103.70 ms per loop 
> > 
> > So based on inside s(r,x) calls, the i7 should be faster, but the 
> > whole s(r,x) is slower. Still clueless... And don't know how to 
> > further pin this down... 
> Thanks. I think you got mixed up with the different files, as the 
> versioninfo() output indicates. Anyway, there's enough info to check 
> which file corresponds to which Julia version, so that's OK. Indeed, 
> when comparing the tests with binary tarballs, there's a call 
> to jl_alloc_array_1d with the i7 (julia050_tarball-haswell-i7.txt), 
> which is not present with the i5 (incorrectly named julia050_haswell- 
> i7.txt). This is really unexpected. 
>
> Could you file an issue on GitHub with a summary of what we've found 
> (essentially your message), as well as links to 3 Gists giving the code 
> and the contents of the two .txt files I mentioned above? That would be 
> very helpful. Do not mention the Fedora packages at all, as the binary 
> tarballs are closer to what Julia developers use. 
>
>
> Regards 
>
>
> > cheers, Johannes 
> > 
> > 
> > 
> > 
> > > Le lundi 04 avril 2016 à 10:36 -0700, Johannes Wagner a écrit :  
> > > > hey guys,  
> > > > so attached you find text files with @code_native output for the  
> > > > instructions   
> > > > - r * x[1,:]  
> > > > - cis(imexp

Re: [julia-users] Re: regression from 0.43 to 0.5dev, and back to 0.43 on fedora23

2016-04-06 Thread Johannes Wagner


On Tuesday, April 5, 2016 at 7:54:16 PM UTC+2, Milan Bouchet-Valat wrote:
>
> Le mardi 05 avril 2016 à 10:18 -0700, Johannes Wagner a écrit : 
> > hey Milan, 
> > so consider following code: 
> > 
> > Pkg.clone("git://github.com/kbarbary/TimeIt.jl.git") 
> > using TimeIt 
> > 
> > v = rand(3) 
> > r = rand(6000,3) 
> > x = linspace(0.0, 10.0, 500) * (v./sqrt(sumabs2(v)))' 
> > 
> > dotprods = r * x[2,:] 
> > imexp= cis(dotprods) 
> > sumprod  = sum(imexp) * sum(conj(imexp)) 
> > 
> > f(r, x) = r * x[2,:] 
> > g(r, x) = r * x' 
> > h(imexp)= sum(imexp) * sum(conj(imexp)) 
> > 
> > function s(r, x) 
> > result = zeros(size(x,1)) 
> > for i = 1:size(x,1) 
> > imexp= cis(r * x[i,:]) 
> > result[i]= sum(imexp) * sum(conj(imexp)) 
> > end 
> > return result 
> > end 
> > 
> > @timeit zeros(size(x,1)) 
> > @timeit f(r,x) 
> > @timeit g(r,x) 
> > @timeit cis(dotprods) 
> > @timeit h(imexp) 
> > @timeit s(r,x) 
> > 
> > @code_native f(r,x) 
> > @code_native g(r,x) 
> > @code_native cis(dotprods) 
> > @code_native h(imexp) 
> > @code_native s(r,x) 
> > 
> > and I attached the output of the last @code_native s(r,x) as text 
> > files for the binary tarball, as well as the latest nalimilan update. 
> > For the whole function s, the exported code looks actually the same 
> > everywhere. 
> > But s(r,x) is the one that is considerable slower on the i7 than the 
> > i5, whereas all the other timed calls are more or less same speed on 
> > i5 and i7. Here are the timings in the same order as above (all run 
> > repeatedly to not have compile time in it for last one): 
> > 
> > i7: 
> > 100 loops, best of 3: 871.68 ns per loop 
> > 1 loops, best of 3: 10.84 µs per loop 
> > 100 loops, best of 3: 5.19 ms per loop 
> > 1 loops, best of 3: 71.35 µs per loop 
> > 1 loops, best of 3: 26.65 µs per loop 
> > 1 loops, best of 3: 159.99 ms per loop 
> > 
> > i5: 
> > 10 loops, best of 3: 1.01 µs per loop 
> > 1 loops, best of 3: 10.93 µs per loop 
> > 100 loops, best of 3: 5.09 ms per loop 
> > 1 loops, best of 3: 75.93 µs per loop 
> > 1 loops, best of 3: 29.23 µs per loop 
> > 1 loops, best of 3: 103.70 ms per loop 
> > 
> > So based on inside s(r,x) calls, the i7 should be faster, but the 
> > whole s(r,x) is slower. Still clueless... And don't know how to 
> > further pin this down... 
> Thanks. I think you got mixed up with the different files, as the 
> versioninfo() output indicates. Anyway, there's enough info to check 
> which file corresponds to which Julia version, so that's OK. Indeed, 
> when comparing the tests with binary tarballs, there's a call 
> to jl_alloc_array_1d with the i7 (julia050_tarball-haswell-i7.txt), 
> which is not present with the i5 (incorrectly named julia050_haswell- 
> i7.txt). This is really unexpected. 
>

I'm afraid not. Filename was correct, header was wrong. The difference in 
instructions for the whole loop is between tarball and nalimilan repo. See 
the attached file (double checked) again. Despite the assembly differences, 
the tarball and nalimilan julia on i5 behave same and have same speed. Same 
for the i7, both slower. The tarball julia 0.50 just seems a tad faster 
(2-5%) for both, i5 and i7.
 

> Could you file an issue on GitHub with a summary of what we've found 
> (essentially your message), as well as links to 3 Gists giving the code 
> and the contents of the two .txt files I mentioned above? That would be 
> very helpful. Do not mention the Fedora packages at all, as the binary 
> tarballs are closer to what Julia developers use. 
>
>
> Regards 
>
>
> > cheers, Johannes 
> > 
> > 
> > 
> > 
> > > Le lundi 04 avril 2016 à 10:36 -0700, Johannes Wagner a écrit :  
> > > > hey guys,  
> > > > so attached you find text files with @code_native output for the  
> > > > instructions   
> > > > - r * x[1,:]  
> > > > - cis(imexp)  
> > > > - sum(imexp) * sum(conj(imexp))  
> > > >  
> > > > for julia 0.5.   
> > > >  
> > > > Hardware I run on is a Haswell i5 machine, a Haswell i7 machine, 
> > > and  
> > > > a IvyBridge i5 machine. Turned out on an Haswell i5 machine the 
> > > code  
> > > > also runs fast. Only the Haswell i7 machine is the slow one. 
&

Re: [julia-users] Re: regression from 0.43 to 0.5dev, and back to 0.43 on fedora23

2016-04-05 Thread Johannes Wagner
hey Milan,
so consider following code:

Pkg.clone("git://github.com/kbarbary/TimeIt.jl.git")
using TimeIt

v = rand(3)
r = rand(6000,3)
x = linspace(0.0, 10.0, 500) * (v./sqrt(sumabs2(v)))'

dotprods = r * x[2,:]
imexp= cis(dotprods)
sumprod  = sum(imexp) * sum(conj(imexp))

f(r, x) = r * x[2,:]
g(r, x) = r * x'
h(imexp)= sum(imexp) * sum(conj(imexp))

function s(r, x)
result = zeros(size(x,1))
for i = 1:size(x,1)
imexp= cis(r * x[i,:])
result[i]= sum(imexp) * sum(conj(imexp))
end
return result
end

@timeit zeros(size(x,1))
@timeit f(r,x)
@timeit g(r,x)
@timeit cis(dotprods)
@timeit h(imexp)
@timeit s(r,x)

@code_native f(r,x)
@code_native g(r,x)
@code_native cis(dotprods)
@code_native h(imexp)
@code_native s(r,x)

and I attached the output of the last @code_native s(r,x) as text files for 
the binary tarball, as well as the latest nalimilan update. For the whole 
function s, the exported code looks actually the same everywhere.
But s(r,x) is the one that is considerable slower on the i7 than the i5, 
whereas all the other timed calls are more or less same speed on i5 and i7. 
Here are the timings in the same order as above (all run repeatedly to not 
have compile time in it for last one):

i7:
100 loops, best of 3: 871.68 ns per loop
1 loops, best of 3: 10.84 µs per loop
100 loops, best of 3: 5.19 ms per loop
1 loops, best of 3: 71.35 µs per loop
1 loops, best of 3: 26.65 µs per loop
1 loops, best of 3: 159.99 ms per loop

i5:
10 loops, best of 3: 1.01 µs per loop
1 loops, best of 3: 10.93 µs per loop
100 loops, best of 3: 5.09 ms per loop
1 loops, best of 3: 75.93 µs per loop
1 loops, best of 3: 29.23 µs per loop
1 loops, best of 3: 103.70 ms per loop

So based on inside s(r,x) calls, the i7 should be faster, but the whole 
s(r,x) is slower. Still clueless... And don't know how to further pin this 
down...

cheers, Johannes




On Monday, April 4, 2016 at 10:48:40 PM UTC+2, Milan Bouchet-Valat wrote:
>
> Le lundi 04 avril 2016 à 10:36 -0700, Johannes Wagner a écrit : 
> > hey guys, 
> > so attached you find text files with @code_native output for the 
> > instructions  
> > - r * x[1,:] 
> > - cis(imexp) 
> > - sum(imexp) * sum(conj(imexp)) 
> > 
> > for julia 0.5.  
> > 
> > Hardware I run on is a Haswell i5 machine, a Haswell i7 machine, and 
> > a IvyBridge i5 machine. Turned out on an Haswell i5 machine the code 
> > also runs fast. Only the Haswell i7 machine is the slow one. This 
> > really drove me nuts. First I thought it was the OS, then the 
> > architecture, and now its just from i5 to i7 Anyways, I don't 
> > know anything about x86 assembly, but the julia 0.45 code is the same 
> > on all machines. However, for the dot product, the 0.5 code has 
> > already 2 different instructions on the i5 vs. the i7 (line 44&47). 
> > For the cis call also (line 149...). And the IvyBridge i5 code is 
> > similar to the Haswell i5. I included also versioninfo() at the top 
> > of the file. So you could just look at a vimdiff of the julia0.5 
> > files... Can anyone make sense out of this? 
> I'm definitely not an expert in assembly, but that additional leaq 
> instruction on line 44, and the additional movq instructions on line 
> 111, 151 and 152 really look weird 
>
> Could you do the same test with the binary tarballs? If the difference 
> persists, you should open an issue on GitHub to track this. 
>
> BTW, please wrap the fist call in a function to ensure it is 
> specialized for the arguments types, i.e.: 
>
> f(r, x) = r * x[1,:] 
> @code_native f(r, x) 
>
> Also, please check whether you still see the difference with this code: 
> g(r, x) = r * x 
> @code_native g(r, x[1,:]) 
>
> What are the types of r and x? Could you provide a simple reproducible 
> example with dummy values? 
>
> > The binary tarballs I will still test. If I remove the cis() call, 
> > the difference is hard to tell, the loop is ~10times faster and more 
> > or less all around 5ms. For the whole loop with cis() call, from i5 
> > to i7 the difference is ~ 50ms on i5 to 90ms on i7. 
> > 
> > Shall I also post the julia 0.4 code? 
> If it's identical for all machines, I don't think it's needed. 
>
>
> Regards 
>
>
> > cheers, Johannes 
> > 
> > 
> > 
> > > Le mercredi 30 mars 2016 à 15:16 -0700, Johannes Wagner a écrit :  
> > > >  
> > > >  
> > > > > Le mercredi 30 mars 2016 à 04:43 -0700, Johannes Wagner a 
> écrit :   
> > > > > > Sorry for not having expressed myself clearly, I meant the

Re: [julia-users] Re: regression from 0.43 to 0.5dev, and back to 0.43 on fedora23

2016-04-04 Thread Johannes Wagner
hey guys,
so attached you find text files with @code_native output for the 
instructions 
- r * x[1,:]
- cis(imexp)
- sum(imexp) * sum(conj(imexp))

for julia 0.5. 

Hardware I run on is a Haswell i5 machine, a Haswell i7 machine, and a 
IvyBridge i5 machine. Turned out on an Haswell i5 machine the code also 
runs fast. Only the Haswell i7 machine is the slow one. This really drove 
me nuts. First I thought it was the OS, then the architecture, and now its 
just from i5 to i7 Anyways, I don't know anything about x86 assembly, 
but the julia 0.45 code is the same on all machines. However, for the dot 
product, the 0.5 code has already 2 different instructions on the i5 vs. 
the i7 (line 44&47). For the cis call also (line 149...). And the IvyBridge 
i5 code is similar to the Haswell i5. I included also versioninfo() at the 
top of the file. So you could just look at a vimdiff of the julia0.5 
files... Can anyone make sense out of this?

The binary tarballs I will still test. If I remove the cis() call, the 
difference is hard to tell, the loop is ~10times faster and more or less 
all around 5ms. For the whole loop with cis() call, from i5 to i7 the 
difference is ~ 50ms on i5 to 90ms on i7.

Shall I also post the julia 0.4 code?

cheers, Johannes



On Thursday, March 31, 2016 at 10:27:11 AM UTC+2, Milan Bouchet-Valat wrote:
>
> Le mercredi 30 mars 2016 à 15:16 -0700, Johannes Wagner a écrit : 
> > 
> > 
> > > Le mercredi 30 mars 2016 à 04:43 -0700, Johannes Wagner a écrit :  
> > > > Sorry for not having expressed myself clearly, I meant the latest  
> > > > version of fedora to work fine (24 development). I always used the  
> > > > latest julia nightly available on the copr nalimilan repo. Right 
> now  
> > > > that is: 0.5.0-dev+3292, Commit 9d527c5*, all use  
> > > > LLVM: libLLVM-3.7.1 (ORCJIT, haswell)  
> > > >  
> > > > peakflops on all machines (hardware identical) is ~1.2..1.5e11.
> > > >  
> > > > Fedora 22&23 with julia 0.5 is ~50% slower then 0.4, only on fedora  
> > > > 24 julia 0.5 is  faster compared to julia 0.4.  
> > > Could you try to find a simple code to reproduce the problem? In  
> > > particular, it would be useful to check whether this comes from  
> > > OpenBLAS differences or whether it also happens with pure Julia code  
> > > (typical operations which depend on BLAS are matrix multiplication, 
> as  
> > > well as most of linear algebra). Normally, 0.4 and 0.5 should use the  
> > > same BLAS, but who knows...  
> > well thats what I did, and the 3 simple calls inside the loop are 
> > more or less same speed. only the whole loop seems slower. See my 
> > code sample fromanswer march 8th (code gets in same proportions 
> > faster when exp(im .* dotprods) is replaced by cis(dotprods) ).  
> > So I don't know what I can do then...   
> Sorry, somehow I had missed that message. This indeed looks like a code 
> generation issue in Julia/LLVM. 
>
> > > Can you also confirm that all versioninfo() fields are the same for 
> all  
> > > three machines, both for 0.4 and 0.5? We must envision the 
> possibility  
> > > that the differences actually come from 0.4.  
> > ohoh, right! just noticed that my fedora 24 machine was an ivy bridge 
> > which works fast: 
> > 
> > Julia Version 0.5.0-dev+3292 
> > Commit 9d527c5* (2016-03-28 06:55 UTC) 
> > Platform Info: 
> >   System: Linux (x86_64-redhat-linux) 
> >   CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz 
> >   WORD_SIZE: 64 
> >   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Sandybridge) 
> >   LAPACK: libopenblasp.so.0 
> >   LIBM: libopenlibm 
> >   LLVM: libLLVM-3.7.1 (ORCJIT, ivybridge) 
> > 
> > and the other ones with fed22/23 are haswell, which work slow: 
> > 
> > Julia Version 0.5.0-dev+3292 
> > Commit 9d527c5* (2016-03-28 06:55 UTC) 
> > Platform Info: 
> >   System: Linux (x86_64-redhat-linux) 
> >   CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz 
> >   WORD_SIZE: 64 
> >   BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell) 
> >   LAPACK: libopenblasp.so.0 
> >   LIBM: libopenlibm 
> >   LLVM: libLLVM-3.7.1 (ORCJIT, haswell) 
> > 
> > I just booted an fedora 23 on the ivy bridge machine and it's also 
> fast.  
> >   
> > Now if I use julia 0.45 on both architectures: 
> > 
> > Julia Version 0.4.5 
> > Commit 2ac304d* (2016-03-18 00:58 UTC) 
> > Platform Info: 
> >   System: Linux (x86_64-redhat-linux) 
> >   CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz 
> >   WORD_SIZE: 64 
> >   BL

Re: [julia-users] Re: regression from 0.43 to 0.5dev, and back to 0.43 on fedora23

2016-03-30 Thread Johannes Wagner


On Wednesday, March 30, 2016 at 1:58:23 PM UTC+2, Milan Bouchet-Valat wrote:
>
> Le mercredi 30 mars 2016 à 04:43 -0700, Johannes Wagner a écrit : 
> > Sorry for not having expressed myself clearly, I meant the latest 
> > version of fedora to work fine (24 development). I always used the 
> > latest julia nightly available on the copr nalimilan repo. Right now 
> > that is: 0.5.0-dev+3292, Commit 9d527c5*, all use 
> > LLVM: libLLVM-3.7.1 (ORCJIT, haswell) 
> > 
> > peakflops on all machines (hardware identical) is ~1.2..1.5e11.   
> > 
> > Fedora 22&23 with julia 0.5 is ~50% slower then 0.4, only on fedora 
> > 24 julia 0.5 is  faster compared to julia 0.4. 
> Could you try to find a simple code to reproduce the problem? In 
> particular, it would be useful to check whether this comes from 
> OpenBLAS differences or whether it also happens with pure Julia code 
> (typical operations which depend on BLAS are matrix multiplication, as 
> well as most of linear algebra). Normally, 0.4 and 0.5 should use the 
> same BLAS, but who knows... 
>

well thats what I did, and the 3 simple calls inside the loop are more or 
less same speed. only the whole loop seems slower. See my code sample from 
answer march 8th (code gets in same proportions faster when exp(im .* 
dotprods) is replaced by cis(dotprods) ). 
So I don't know what I can do then... 

Can you also confirm that all versioninfo() fields are the same for all 
> three machines, both for 0.4 and 0.5? We must envision the possibility 
> that the differences actually come from 0.4. 


ohoh, right! just noticed that my fedora 24 machine was an ivy bridge which 
works fast:

Julia Version 0.5.0-dev+3292
Commit 9d527c5* (2016-03-28 06:55 UTC)
Platform Info:
  System: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblasp.so.0
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, ivybridge)

and the other ones with fed22/23 are haswell, which work slow:

Julia Version 0.5.0-dev+3292
Commit 9d527c5* (2016-03-28 06:55 UTC)
Platform Info:
  System: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblasp.so.0
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

I just booted an fedora 23 on the ivy bridge machine and it's also fast. 
 
Now if I use julia 0.45 on both architectures:

Julia Version 0.4.5
Commit 2ac304d* (2016-03-18 00:58 UTC)
Platform Info:
  System: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblasp.so.0
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

and:

Julia Version 0.4.5
Commit 2ac304d* (2016-03-18 00:58 UTC)
Platform Info:
  System: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Core(TM) i5-3550 CPU @ 3.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblasp.so.0
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

there is no speed difference apart from the ~10% or so from the faster 
haswell machine. So could perhaps be haswell hardware target specific with 
the change from llvm 3.3 to 3.7.1? Is there anything else I could provide?

Best, Johannes

 Regards 


>
> > > Le mercredi 16 mars 2016 à 09:25 -0700, Johannes Wagner a écrit :  
> > > > just a little update. Tested some other fedoras: Fedora 22 with 
> llvm  
> > > > 3.8 is also slow with julia 0.5, whereas a fedora 24 branch with 
> llvm  
> > > > 3.7 is faster on julia 0.5 compared to julia 0.4, as it should be  
> > > > (speedup from inner loop parts translated into speedup to whole  
> > > > function).  
> > > >  
> > > > don't know if anyone cares about that... At least the latest 
> version  
> > > > seems to work fine, hope it stays like this into the final fedora 
> 24  
> > > What's the "latest version"? git built from source or RPM nightlies?  
> > > With which LLVM version for each?  
> > > 
> > > If from the RPMs, I've switched them to LLVM 3.8 for a few days, and  
> > > went back to 3.7 because of a build failure. So that might explain 
> the  
> > > difference. You can install the last version which built with LLVM 
> 3.8  
> > > manually from here:  
> > > 
> https://copr-be.cloud.fedoraproject.org/results/nalimilan/julia-nightlies/fedora-23-x86_64/00167549-julia/
>   
>
> > > 
> > > It would be interesting to compare it with the latest nightly with 
> 3.7.  
> > > 
> > > 
> > > Regards  
> > > 
>

Re: [julia-users] Re: regression from 0.43 to 0.5dev, and back to 0.43 on fedora23

2016-03-30 Thread Johannes Wagner
Sorry for not having expressed myself clearly, I meant the latest version 
of fedora to work fine (24 development). I always used the latest julia 
nightly available on the copr nalimilan repo. Right now that is: 
0.5.0-dev+3292, Commit 9d527c5*, all use
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

peakflops on all machines (hardware identical) is ~1.2..1.5e11.  

Fedora 22&23 with julia 0.5 is ~50% slower then 0.4, only on fedora 24 
julia 0.5 is  faster compared to julia 0.4.


On Wednesday, March 16, 2016 at 7:34:28 PM UTC+1, Milan Bouchet-Valat wrote:
>
> Le mercredi 16 mars 2016 à 09:25 -0700, Johannes Wagner a écrit : 
> > just a little update. Tested some other fedoras: Fedora 22 with llvm 
> > 3.8 is also slow with julia 0.5, whereas a fedora 24 branch with llvm 
> > 3.7 is faster on julia 0.5 compared to julia 0.4, as it should be 
> > (speedup from inner loop parts translated into speedup to whole 
> > function). 
> > 
> > don't know if anyone cares about that... At least the latest version 
> > seems to work fine, hope it stays like this into the final fedora 24 
> What's the "latest version"? git built from source or RPM nightlies? 
> With which LLVM version for each? 
>
> If from the RPMs, I've switched them to LLVM 3.8 for a few days, and 
> went back to 3.7 because of a build failure. So that might explain the 
> difference. You can install the last version which built with LLVM 3.8 
> manually from here: 
>
> https://copr-be.cloud.fedoraproject.org/results/nalimilan/julia-nightlies/fedora-23-x86_64/00167549-julia/
>  
>
> It would be interesting to compare it with the latest nightly with 3.7. 
>
>
> Regards 
>
>
>
> > > hey guys, 
> > > I just experienced something weird. I have some code that runs fine 
> > > on 0.43, then I updated to 0.5dev to test the new Arrays, run same 
> > > code and noticed it got about ~50% slower. Then I downgraded back 
> > > to 0.43, ran the old code, but speed remained slow. I noticed while 
> > > reinstalling 0.43, openblas-threads didn't get isntalled along with 
> > > it. So I manually installed it, but no change.  
> > > Does anyone has an idea what could be going on? LLVM on fedora23 is 
> > > 3.7 
> > > 
> > > Cheers, Johannes 
> > > 
>


[julia-users] Re: regression from 0.43 to 0.5dev, and back to 0.43 on fedora23

2016-03-19 Thread Johannes Wagner
just a little update. Tested some other fedoras: Fedora 22 with llvm 3.8 is 
also slow with julia 0.5, whereas a fedora 24 branch with llvm 3.7 is 
faster on julia 0.5 compared to julia 0.4, as it should be (speedup from 
inner loop parts translated into speedup to whole function).

don't know if anyone cares about that... At least the latest version seems 
to work fine, hope it stays like this into the final fedora 24



On Friday, February 26, 2016 at 7:08:06 PM UTC+4, Johannes Wagner wrote:
>
> hey guys,
> I just experienced something weird. I have some code that runs fine on 
> 0.43, then I updated to 0.5dev to test the new Arrays, run same code and 
> noticed it got about ~50% slower. Then I downgraded back to 0.43, ran the 
> old code, but speed remained slow. I noticed while reinstalling 0.43, 
> openblas-threads didn't get isntalled along with it. So I manually 
> installed it, but no change. 
> Does anyone has an idea what could be going on? LLVM on fedora23 is 3.7
>
> Cheers, Johannes
>


[julia-users] Re: regression from 0.43 to 0.5dev, and back to 0.43 on fedora23

2016-03-08 Thread Johannes Wagner
as an example, the data looks like this:

using(TimeIt)
v = rand(3)
r = rand(6000,3)
x = linspace(1.0, 2.0, 300) * (v./sqrt(sumabs2(v)))'

*# Julia 0.4 function*

function s04(xl, rl)
result = zeros(size(xl,1))
for i = 1:size(xl,1)
dotprods = rl * xl[i,:]'   
  #1 loops, best of 3: 17.66 µs per loop
imexp  = exp(im .* dotprods)   
#1000 loops, best of 3: 172.33 µs per loop
sumprod  = sum(imexp) * sum(conj(imexp))   #1 
loops, best of 3: 21.04 µs per loop
result[i] = sumprod
end
return result
end

and using @timeit s04(x,r) gives 
10 loops, best of 3: *67.52 ms* per loop
where most time is spend in the exp() calls. Now in 0.5dev, the individual 
parts have similar or actually better timings like the dot product:

*# Julia 0.5 function*

function s05(xl, rl)
result = zeros(size(xl,1))
for i = 1:size(xl,1)
dotprods = rl * xl[i,:] 
   #1 loops, best of 3: 10.99 µs per loop
imexp  = exp(im .* dotprods)  #1000 
loops, best of 3: 158.50 µs per loop
sumprod  = sum(imexp) * sum(conj(imexp))  #1 loops, 
best of 3: 21.81 µs per loop
result[i] = sumprod
end
return result
end

but @timeit s05(x,r) always gives something ~70% worse runtime:
10 loops, best of 3: *113.80 ms* per loop

And always the same on my Fedora23 workstation, individual calls inside the 
function have slightly better performance in 0.5dev, but the whole function 
is slower. But oddly enough only on my Fedora workstation! On a OS X 
laptop, those 0.5dev speedups from the parts inside the loop translate in 
the expected speedup for the whole function!
So that puzzles me, could someone perhaps reproduce this with above 
function and input on an linux system, preferably also fedora?

cheers, Johannes

On Friday, February 26, 2016 at 4:28:05 PM UTC+1, Kristoffer Carlsson wrote:
>
> What code and where is it spending time? You talk about openblas, does it 
> mean that blas got slower for you? How about peakflops() on the different 
> versions?
>
> On Friday, February 26, 2016 at 4:08:06 PM UTC+1, Johannes Wagner wrote:
>>
>> hey guys,
>> I just experienced something weird. I have some code that runs fine on 
>> 0.43, then I updated to 0.5dev to test the new Arrays, run same code and 
>> noticed it got about ~50% slower. Then I downgraded back to 0.43, ran the 
>> old code, but speed remained slow. I noticed while reinstalling 0.43, 
>> openblas-threads didn't get isntalled along with it. So I manually 
>> installed it, but no change. 
>> Does anyone has an idea what could be going on? LLVM on fedora23 is 3.7
>>
>> Cheers, Johannes
>>
>

[julia-users] Re: regression from 0.43 to 0.5dev, and back to 0.43 on fedora23

2016-02-27 Thread Johannes Wagner
as an example, the data looks like this:

v = rand(3)
r = rand(6000,3)
x = linspace(1.0, 2.0, 300) * (v./sqrt(sumabs2(v)))'

my function in 0.4 looks like this:

function s04(xl, rl)
result = zeros(size(xl,1))
for i = 1:size(xl,1)
dotprods = rl * xl[i,:]'   
  #1 loops, best of 3: 17.66 µs per loop
imexp  = exp(im .* dotprods)   
#1000 loops, best of 3: 172.33 µs per loop
sumprod  = sum(imexp) * sum(conj(imexp))   #1 
loops, best of 3: 21.04 µs per loop
result[i] = sumprod
end
return result
end

and using @timeit s04(x,r) gives 
10 loops, best of 3: 67.52 ms per loop
where most time is spend in the exp() calls. Now in 0.5dev, the individual 
parts have similar or actually better timings like the dot product:

function s05(xl, rl)
result = zeros(size(xl,1))
for i = 1:size(xl,1)
dotprods = rl * xl[i,:] 
   #1 loops, best of 3: 10.99 µs per loop
imexp  = exp(im .* dotprods)  #1000 
loops, best of 3: 158.50 µs per loop
sumprod  = sum(imexp) * sum(conj(imexp))  #1 loops, 
best of 3: 21.81 µs per loop
result[i] = sumprod
end
return result
end

but @timeit s05(x,r) always gives something ~70% worse runtime:
10 loops, best of 3: 113.80 ms per loop

the summing I replaced then by the blas counterpart, for a modest speedup:

sumprod  = Base.LinAlg.BLAS.asum(imexp) * 
Base.LinAlg.BLAS.asum(conj(imexp))  #1 loops, best of 3: 17.02 µs 
per loop
and the exp() call also runs a bit fast devectorized. But always the same 
on my Fedora23 workstation, individual calls inside the function have 
slightly better performance in 0.5dev, but the whole function is slower. 
And oddly enough only on my Fedora workstation! On a OS X laptop, those 
0.5dev speedups from the parts inside the loop translate in the expected 
speedup for the whole function!
So that puzzles me, perhaps someone can reproduce this with above function 
and input?

cheers, Johannes





On Friday, February 26, 2016 at 4:28:05 PM UTC+1, Kristoffer Carlsson wrote:
>
> What code and where is it spending time? You talk about openblas, does it 
> mean that blas got slower for you? How about peakflops() on the different 
> versions?
>
> On Friday, February 26, 2016 at 4:08:06 PM UTC+1, Johannes Wagner wrote:
>>
>> hey guys,
>> I just experienced something weird. I have some code that runs fine on 
>> 0.43, then I updated to 0.5dev to test the new Arrays, run same code and 
>> noticed it got about ~50% slower. Then I downgraded back to 0.43, ran the 
>> old code, but speed remained slow. I noticed while reinstalling 0.43, 
>> openblas-threads didn't get isntalled along with it. So I manually 
>> installed it, but no change. 
>> Does anyone has an idea what could be going on? LLVM on fedora23 is 3.7
>>
>> Cheers, Johannes
>>
>

[julia-users] Re: regression from 0.43 to 0.5dev, and back to 0.43 on fedora23

2016-02-27 Thread Johannes Wagner
spending time mostly on exp() calls on an array, other than that a dot 
product and a sum... openblas I only mentioned because it uninstalled 
openblas-threads while uninstalling julia 0.5 (by dnf dependency in fedora) 
and while installing again, openblas-threads did not get installed as 
dependency. I don't know how exactly this might affect runtime as I'm not 
sure how openblas is deployed in the background, especially if you are not 
calling base.LinAlg.BLAS functions explicitly...


On Friday, February 26, 2016 at 4:28:05 PM UTC+1, Kristoffer Carlsson wrote:
>
> What code and where is it spending time? You talk about openblas, does it 
> mean that blas got slower for you? How about peakflops() on the different 
> versions?
>
> On Friday, February 26, 2016 at 4:08:06 PM UTC+1, Johannes Wagner wrote:
>>
>> hey guys,
>> I just experienced something weird. I have some code that runs fine on 
>> 0.43, then I updated to 0.5dev to test the new Arrays, run same code and 
>> noticed it got about ~50% slower. Then I downgraded back to 0.43, ran the 
>> old code, but speed remained slow. I noticed while reinstalling 0.43, 
>> openblas-threads didn't get isntalled along with it. So I manually 
>> installed it, but no change. 
>> Does anyone has an idea what could be going on? LLVM on fedora23 is 3.7
>>
>> Cheers, Johannes
>>
>

Re: [julia-users] Re: regression from 0.43 to 0.5dev, and back to 0.43 on fedora23

2016-02-27 Thread Johannes Wagner
I did not compile them, just installed them from the nalimilan/julia and 
nalimilan/julia-nightlies  repos with dnf on fedora23.


On Friday, February 26, 2016 at 4:39:28 PM UTC+1, Yichao Yu wrote:
>
> On Fri, Feb 26, 2016 at 10:28 AM, Kristoffer Carlsson 
> > wrote: 
> > What code and where is it spending time? You talk about openblas, does 
> it 
> > mean that blas got slower for you? How about peakflops() on the 
> different 
> > versions? 
> > 
> > 
> > On Friday, February 26, 2016 at 4:08:06 PM UTC+1, Johannes Wagner wrote: 
> >> 
> >> hey guys, 
> >> I just experienced something weird. I have some code that runs fine on 
> >> 0.43, then I updated to 0.5dev to test the new Arrays, run same code 
> and 
> >> noticed it got about ~50% slower. Then I downgraded back to 0.43, ran 
> the 
> >> old code, but speed remained slow. I noticed while reinstalling 0.43, 
> >> openblas-threads didn't get isntalled along with it. So I manually 
> installed 
> >> it, but no change. 
> >> Does anyone has an idea what could be going on? LLVM on fedora23 is 3.7 
>
> Also, how did you install/compile the two versions. 
>
> >> 
> >> Cheers, Johannes 
>


[julia-users] regression from 0.43 to 0.5dev, and back to 0.43 on fedora23

2016-02-26 Thread Johannes Wagner
hey guys,
I just experienced something weird. I have some code that runs fine on 
0.43, then I updated to 0.5dev to test the new Arrays, run same code and 
noticed it got about ~50% slower. Then I downgraded back to 0.43, ran the 
old code, but speed remained slow. I noticed while reinstalling 0.43, 
openblas-threads didn't get isntalled along with it. So I manually 
installed it, but no change. 
Does anyone has an idea what could be going on? LLVM on fedora23 is 3.7

Cheers, Johannes