Re: [julia-users] Re: Strange performance problem for array scaling

Yichao Yu Sun, 12 Jul 2015 21:35:40 -0700

On Sun, Jul 12, 2015 at 10:30 PM, Yichao Yu <yyc1...@gmail.com> wrote:
> Further update:
>
> I made a c++ version[1] and see a similar effect (depending on
> optimization levels) so it's not a julia issue (not that I think it
> really was to begin with...).


After investigating the c++ version more, I find that the difference
between the fast_math and the non-fast_math version is that the
compiler emit a function called `set_fast_math` (see below).

>From what I can tell, the function sets bit 6 and bit 15 on the MXCSR
register (for SSE) and according to this page[1] these are DAZ and FZ
bits (both related to underflow). It also describe denormals as "take
considerably longer to process". Since the operation I have keeps
decreasing the value, I guess it makes sense that there's a value
dependent performance.... (and it kind of make sense that fft also
suffers from these values)

So now the question is:

1. How important are underflow and denormal values? Note that I'm not
catching underflow explicitly anyway and I don't really care about
values that are really small compare to 1.

2. Is there a way to set up the SSE registers as done by the c
compilers? @fastmath does not seems to be doing this.

00000000000005b0 <set_fast_math>:
 5b0:    0f ae 5c 24 fc           stmxcsr -0x4(%rsp)
 5b5:    81 4c 24 fc 40 80 00     orl    $0x8040,-0x4(%rsp)
 5bc:    00
 5bd:    0f ae 54 24 fc           ldmxcsr -0x4(%rsp)
 5c2:    c3                       retq
 5c3:    66 2e 0f 1f 84 00 00     nopw   %cs:0x0(%rax,%rax,1)
 5ca:    00 00 00
 5cd:    0f 1f 00                 nopl   (%rax)

[1] http://softpixel.com/~cwright/programming/simd/sse.php

>
> The slow down is presented in the c++ version for all optimization
> levels except Ofast and ffast-math. The julia version is faster than
> the default performance for both gcc and clang but is slower in the
> fast case for higher optmization levels. For O2 and higher, the c++

The slowness of the julia version seems to be due to multi dimentional
arrays. Using 1d array yields similar performance with C.

> version shows a ~100x slow down for the slow case.
>
> @fast_math in julia doesn't seem to have an effect for this although
> it does for clang and gcc...
>
> [1] 
> https://github.com/yuyichao/explore/blob/5a644cd46dc6f8056cee69f508f9e995b5839a01/julia/array_prop/propagate.cpp
>
> On Sun, Jul 12, 2015 at 9:23 PM, Yichao Yu <yyc1...@gmail.com> wrote:
>> Update:
>>
>> I've just got an even simpler version without any complex numbers and
>> only has Float64. The two loops are as small as the following LLVM-IR
>> now and there's only simple arithmetics in the loop body.
>>
>> ```llvm
>> L9.preheader:                                     ; preds = %L12,
>> %L9.preheader.preheader
>>   %"#s3.0" = phi i64 [ %60, %L12 ], [ 1, %L9.preheader.preheader ]
>>   br label %L9
>>
>> L9:                                               ; preds = %L9, 
>> %L9.preheader
>>   %"#s4.0" = phi i64 [ %44, %L9 ], [ 1, %L9.preheader ]
>>   %44 = add i64 %"#s4.0", 1
>>   %45 = add i64 %"#s4.0", -1
>>   %46 = mul i64 %45, %10
>>   %47 = getelementptr double* %7, i64 %46
>>   %48 = load double* %47, align 8
>>   %49 = add i64 %46, 1
>>   %50 = getelementptr double* %7, i64 %49
>>   %51 = load double* %50, align 8
>>   %52 = fmul double %51, %3
>>   %53 = fmul double %38, %48
>>   %54 = fmul double %33, %52
>>   %55 = fadd double %53, %54
>>   store double %55, double* %50, align 8
>>   %56 = fmul double %38, %52
>>   %57 = fmul double %33, %48
>>   %58 = fsub double %56, %57
>>   store double %58, double* %47, align 8
>>   %59 = icmp eq i64 %"#s4.0", %12
>>   br i1 %59, label %L12, label %L9
>>
>> L12:                                              ; preds = %L9
>>   %60 = add i64 %"#s3.0", 1
>>   %61 = icmp eq i64 %"#s3.0", %42
>>   br i1 %61, label %L14.loopexit, label %L9.preheader
>> ```
>>
>> On Sun, Jul 12, 2015 at 9:01 PM, Yichao Yu <yyc1...@gmail.com> wrote:
>>> On Sun, Jul 12, 2015 at 8:31 PM, Kevin Owens <kevin.j.ow...@gmail.com> 
>>> wrote:
>>>> I can't really help you debug the IR code, but I can at least say I'm 
>>>> seeing
>>>> a similar thing. It starts to slow down around just after 0.5, and doesn't
>>>
>>> Thanks for the confirmation. At least I'm not crazy (or not the only
>>> one to be more precise :P )
>>>
>>>> get back to where it was at 0.5 until 0.87. Can you compare the IR code 
>>>> when
>>>> two different values are used, to see what's different? When I tried 
>>>> looking
>>>> at the difference between 0.50 and 0.51, the biggest thing that popped out
>>>> to me was that the numbers after "!dbg" were different.
>>>
>>> This is exactly the strange part. I don't think either llvm or julia
>>> is doing constant propagation here and different input values should
>>> be using the same function. Evidents are
>>>
>>> 1. Julia only specialize on types not values (for now)
>>> 2. The function is not inlined. Which has to be the case in global
>>> scope and can be double checked by adding `@noinline`. It's also too
>>> big to be inline_worthy
>>> 3. No codegen is happening except the first call. This can be seen
>>> from the output of `@time`. Only the first call has thousands of
>>> allocations. Following call has less than 5 (on 0.4 at least). If
>>> julia can compile a function with less than 5 allocation, I'll be very
>>> happy.....
>>> 4. My original version also has much more complicated logic to compute
>>> that scaling factor (ok. it's just a function call, but with
>>> parameters gathered from different arguments) I'd be really surprised
>>> if either llvm or julia can reason anything about it....
>>>
>>> The difference in the debug info should just be an artifact of
>>> emitting it twice.
>>>
>>>>
>>>> Even 0.50001 is a lot slower:
>>>>
>>>> julia> for eΓ in 0.5:0.00001:0.50015
>>>>            println(eΓ)
>>>>            gc()
>>>>            @time ψs = propagate(P, ψ0, ψs, eΓ)
>>>>        end
>>>> 0.5
>>>> elapsed time: 0.065609581 seconds (16 bytes allocated)
>>>> 0.50001
>>>> elapsed time: 0.875806461 seconds (16 bytes allocated)
>>>
>>> This is actually interesting and I can confirm the same here. `0.5`
>>> takes 28ms while 0.5000000000000001 takes 320ms. I was thinking
>>> whether the cpu is doing sth special depending on the bit pattern but
>>> I still don't understand why it would be very bad for a certain range.
>>> (and is not only a function of this either, also affected by all other
>>> values)
>>>
>>>>
>>>>
>>>> julia> versioninfo()
>>>> Julia Version 0.3.9
>>>> Commit 31efe69 (2015-05-30 11:24 UTC)
>>>> Platform Info:
>>>>   System: Darwin (x86_64-apple-darwin13.4.0)
>>>
>>> Good. so it's not Linux only.
>>>
>>>>   CPU: Intel(R) Core(TM)2 Duo CPU     T8300  @ 2.40GHz
>>>>   WORD_SIZE: 64
>>>>   BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn)
>>>>   LAPACK: libopenblas
>>>>   LIBM: libopenlibm
>>>>   LLVM: libLLVM-3.3
>>>>
>>>>
>>>>
>>>>
>>>> On Sunday, July 12, 2015 at 6:45:53 PM UTC-5, Yichao Yu wrote:
>>>>>
>>>>> On Sun, Jul 12, 2015 at 7:40 PM, Yichao Yu <yyc...@gmail.com> wrote:
>>>>> > P.S. Given how strange this problem is for me, I would appreciate if
>>>>> > anyone can confirm either this is a real issue or I'm somehow being
>>>>> > crazy or stupid.
>>>>> >
>>>>>
>>>>> One additional strange property of this issue is that I used to have
>>>>> much costy operations in the (outer) loop (the one that iterate over
>>>>> nsteps with i) like Fourier transformations. However, when the scaling
>>>>> factor is taking the bad value, it slows everything down (i.e. the
>>>>> Fourier transformation is also slower by ~10x).
>>>>>
>>>>> >
>>>>> >
>>>>> > On Sun, Jul 12, 2015 at 7:30 PM, Yichao Yu <yyc...@gmail.com> wrote:
>>>>> >> Hi,
>>>>> >>
>>>>> >> I've just seen a very strange (for me) performance difference for
>>>>> >> exactly the same code on slightly different input with no explicit
>>>>> >> branches.
>>>>> >>
>>>>> >> The code is available here[1]. The most relavant part is the following
>>>>> >> function. (All other part of the code are for initialization and bench
>>>>> >> mark). This is a simplified version of my similation that compute the
>>>>> >> next array column in the array based on the previous one.
>>>>> >>
>>>>> >> The strange part is that the performance of this function can differ
>>>>> >> by 10x depend on the value of the scaling factor (`eΓ`, the only use
>>>>> >> of which is marked in the code below) even though I don't see any
>>>>> >> branches that depends on that value in the relavant code. (unless the
>>>>> >> cpu is 10x less efficient for certain input values)
>>>>> >>
>>>>> >> function propagate(P, ψ0, ψs, eΓ)
>>>>> >>     @inbounds for i in 1:P.nele
>>>>> >>         ψs[1, i, 1] = ψ0[1, i]
>>>>> >>         ψs[2, i, 1] = ψ0[2, i]
>>>>> >>     end
>>>>> >>     T12 = im * sin(P.Ω)
>>>>> >>     T11 = cos(P.Ω)
>>>>> >>     @inbounds for i in 2:(P.nstep + 1)
>>>>> >>         for j in 1:P.nele
>>>>> >>             ψ_e = ψs[1, j, i - 1]
>>>>> >>             ψ_g = ψs[2, j, i - 1] * eΓ # <---- Scaling factor
>>>>> >>             ψs[2, j, i] = T11 * ψ_e + T12 * ψ_g
>>>>> >>             ψs[1, j, i] = T11 * ψ_g + T12 * ψ_e
>>>>> >>         end
>>>>> >>     end
>>>>> >>     ψs
>>>>> >> end
>>>>> >>
>>>>> >> The output of the full script is attached and it can be clearly seen
>>>>> >> that for scaling factor 0.6-0.8, the performance is 5-10 times slower
>>>>> >> than others.
>>>>> >>
>>>>> >> The assembly[2] and llvm[3] code of this function is also in the same
>>>>> >> repo. I see the same behavior on both 0.3 and 0.4 and with LLVM 3.3
>>>>> >> and LLVM 3.6 on two different x86_64 machine (my laptop and a linode
>>>>> >> VPS) (the only platform I've tried that doesn't show similar behavior
>>>>> >> is running julia 0.4 on qemu-arm....... although the performance
>>>>> >> between different values also differ by ~30% which is bigger than
>>>>> >> noise)
>>>>> >>
>>>>> >> This also seems to depend on the initial value.
>>>>> >>
>>>>> >> Has anyone seen similar problems before?
>>>>> >>
>>>>> >> Outputs:
>>>>> >>
>>>>> >> 325.821 milliseconds (25383 allocations: 1159 KB)
>>>>> >> 307.826 milliseconds (4 allocations: 144 bytes)
>>>>> >> 0.0
>>>>> >>  19.227 milliseconds (2 allocations: 48 bytes)
>>>>> >> 0.1
>>>>> >>  17.291 milliseconds (2 allocations: 48 bytes)
>>>>> >> 0.2
>>>>> >>  17.404 milliseconds (2 allocations: 48 bytes)
>>>>> >> 0.3
>>>>> >>  19.231 milliseconds (2 allocations: 48 bytes)
>>>>> >> 0.4
>>>>> >>  20.278 milliseconds (2 allocations: 48 bytes)
>>>>> >> 0.5
>>>>> >>  23.692 milliseconds (2 allocations: 48 bytes)
>>>>> >> 0.6
>>>>> >> 328.107 milliseconds (2 allocations: 48 bytes)
>>>>> >> 0.7
>>>>> >> 312.425 milliseconds (2 allocations: 48 bytes)
>>>>> >> 0.8
>>>>> >> 201.494 milliseconds (2 allocations: 48 bytes)
>>>>> >> 0.9
>>>>> >>  16.314 milliseconds (2 allocations: 48 bytes)
>>>>> >> 1.0
>>>>> >>  16.264 milliseconds (2 allocations: 48 bytes)
>>>>> >>
>>>>> >>
>>>>> >> [1]
>>>>> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/array_prop.jl
>>>>> >> [2]
>>>>> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.S
>>>>> >> [2]
>>>>> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.ll

Re: [julia-users] Re: Strange performance problem for array scaling

Reply via email to