Further update:

I made a c++ version[1] and see a similar effect (depending on
optimization levels) so it's not a julia issue (not that I think it
really was to begin with...).

The slow down is presented in the c++ version for all optimization
levels except Ofast and ffast-math. The julia version is faster than
the default performance for both gcc and clang but is slower in the
fast case for higher optmization levels. For O2 and higher, the c++
version shows a ~100x slow down for the slow case.

@fast_math in julia doesn't seem to have an effect for this although
it does for clang and gcc...

[1] 
https://github.com/yuyichao/explore/blob/5a644cd46dc6f8056cee69f508f9e995b5839a01/julia/array_prop/propagate.cpp

On Sun, Jul 12, 2015 at 9:23 PM, Yichao Yu <yyc1...@gmail.com> wrote:
> Update:
>
> I've just got an even simpler version without any complex numbers and
> only has Float64. The two loops are as small as the following LLVM-IR
> now and there's only simple arithmetics in the loop body.
>
> ```llvm
> L9.preheader:                                     ; preds = %L12,
> %L9.preheader.preheader
>   %"#s3.0" = phi i64 [ %60, %L12 ], [ 1, %L9.preheader.preheader ]
>   br label %L9
>
> L9:                                               ; preds = %L9, %L9.preheader
>   %"#s4.0" = phi i64 [ %44, %L9 ], [ 1, %L9.preheader ]
>   %44 = add i64 %"#s4.0", 1
>   %45 = add i64 %"#s4.0", -1
>   %46 = mul i64 %45, %10
>   %47 = getelementptr double* %7, i64 %46
>   %48 = load double* %47, align 8
>   %49 = add i64 %46, 1
>   %50 = getelementptr double* %7, i64 %49
>   %51 = load double* %50, align 8
>   %52 = fmul double %51, %3
>   %53 = fmul double %38, %48
>   %54 = fmul double %33, %52
>   %55 = fadd double %53, %54
>   store double %55, double* %50, align 8
>   %56 = fmul double %38, %52
>   %57 = fmul double %33, %48
>   %58 = fsub double %56, %57
>   store double %58, double* %47, align 8
>   %59 = icmp eq i64 %"#s4.0", %12
>   br i1 %59, label %L12, label %L9
>
> L12:                                              ; preds = %L9
>   %60 = add i64 %"#s3.0", 1
>   %61 = icmp eq i64 %"#s3.0", %42
>   br i1 %61, label %L14.loopexit, label %L9.preheader
> ```
>
> On Sun, Jul 12, 2015 at 9:01 PM, Yichao Yu <yyc1...@gmail.com> wrote:
>> On Sun, Jul 12, 2015 at 8:31 PM, Kevin Owens <kevin.j.ow...@gmail.com> wrote:
>>> I can't really help you debug the IR code, but I can at least say I'm seeing
>>> a similar thing. It starts to slow down around just after 0.5, and doesn't
>>
>> Thanks for the confirmation. At least I'm not crazy (or not the only
>> one to be more precise :P )
>>
>>> get back to where it was at 0.5 until 0.87. Can you compare the IR code when
>>> two different values are used, to see what's different? When I tried looking
>>> at the difference between 0.50 and 0.51, the biggest thing that popped out
>>> to me was that the numbers after "!dbg" were different.
>>
>> This is exactly the strange part. I don't think either llvm or julia
>> is doing constant propagation here and different input values should
>> be using the same function. Evidents are
>>
>> 1. Julia only specialize on types not values (for now)
>> 2. The function is not inlined. Which has to be the case in global
>> scope and can be double checked by adding `@noinline`. It's also too
>> big to be inline_worthy
>> 3. No codegen is happening except the first call. This can be seen
>> from the output of `@time`. Only the first call has thousands of
>> allocations. Following call has less than 5 (on 0.4 at least). If
>> julia can compile a function with less than 5 allocation, I'll be very
>> happy.....
>> 4. My original version also has much more complicated logic to compute
>> that scaling factor (ok. it's just a function call, but with
>> parameters gathered from different arguments) I'd be really surprised
>> if either llvm or julia can reason anything about it....
>>
>> The difference in the debug info should just be an artifact of
>> emitting it twice.
>>
>>>
>>> Even 0.50001 is a lot slower:
>>>
>>> julia> for eΓ in 0.5:0.00001:0.50015
>>>            println(eΓ)
>>>            gc()
>>>            @time ψs = propagate(P, ψ0, ψs, eΓ)
>>>        end
>>> 0.5
>>> elapsed time: 0.065609581 seconds (16 bytes allocated)
>>> 0.50001
>>> elapsed time: 0.875806461 seconds (16 bytes allocated)
>>
>> This is actually interesting and I can confirm the same here. `0.5`
>> takes 28ms while 0.5000000000000001 takes 320ms. I was thinking
>> whether the cpu is doing sth special depending on the bit pattern but
>> I still don't understand why it would be very bad for a certain range.
>> (and is not only a function of this either, also affected by all other
>> values)
>>
>>>
>>>
>>> julia> versioninfo()
>>> Julia Version 0.3.9
>>> Commit 31efe69 (2015-05-30 11:24 UTC)
>>> Platform Info:
>>>   System: Darwin (x86_64-apple-darwin13.4.0)
>>
>> Good. so it's not Linux only.
>>
>>>   CPU: Intel(R) Core(TM)2 Duo CPU     T8300  @ 2.40GHz
>>>   WORD_SIZE: 64
>>>   BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn)
>>>   LAPACK: libopenblas
>>>   LIBM: libopenlibm
>>>   LLVM: libLLVM-3.3
>>>
>>>
>>>
>>>
>>> On Sunday, July 12, 2015 at 6:45:53 PM UTC-5, Yichao Yu wrote:
>>>>
>>>> On Sun, Jul 12, 2015 at 7:40 PM, Yichao Yu <yyc...@gmail.com> wrote:
>>>> > P.S. Given how strange this problem is for me, I would appreciate if
>>>> > anyone can confirm either this is a real issue or I'm somehow being
>>>> > crazy or stupid.
>>>> >
>>>>
>>>> One additional strange property of this issue is that I used to have
>>>> much costy operations in the (outer) loop (the one that iterate over
>>>> nsteps with i) like Fourier transformations. However, when the scaling
>>>> factor is taking the bad value, it slows everything down (i.e. the
>>>> Fourier transformation is also slower by ~10x).
>>>>
>>>> >
>>>> >
>>>> > On Sun, Jul 12, 2015 at 7:30 PM, Yichao Yu <yyc...@gmail.com> wrote:
>>>> >> Hi,
>>>> >>
>>>> >> I've just seen a very strange (for me) performance difference for
>>>> >> exactly the same code on slightly different input with no explicit
>>>> >> branches.
>>>> >>
>>>> >> The code is available here[1]. The most relavant part is the following
>>>> >> function. (All other part of the code are for initialization and bench
>>>> >> mark). This is a simplified version of my similation that compute the
>>>> >> next array column in the array based on the previous one.
>>>> >>
>>>> >> The strange part is that the performance of this function can differ
>>>> >> by 10x depend on the value of the scaling factor (`eΓ`, the only use
>>>> >> of which is marked in the code below) even though I don't see any
>>>> >> branches that depends on that value in the relavant code. (unless the
>>>> >> cpu is 10x less efficient for certain input values)
>>>> >>
>>>> >> function propagate(P, ψ0, ψs, eΓ)
>>>> >>     @inbounds for i in 1:P.nele
>>>> >>         ψs[1, i, 1] = ψ0[1, i]
>>>> >>         ψs[2, i, 1] = ψ0[2, i]
>>>> >>     end
>>>> >>     T12 = im * sin(P.Ω)
>>>> >>     T11 = cos(P.Ω)
>>>> >>     @inbounds for i in 2:(P.nstep + 1)
>>>> >>         for j in 1:P.nele
>>>> >>             ψ_e = ψs[1, j, i - 1]
>>>> >>             ψ_g = ψs[2, j, i - 1] * eΓ # <---- Scaling factor
>>>> >>             ψs[2, j, i] = T11 * ψ_e + T12 * ψ_g
>>>> >>             ψs[1, j, i] = T11 * ψ_g + T12 * ψ_e
>>>> >>         end
>>>> >>     end
>>>> >>     ψs
>>>> >> end
>>>> >>
>>>> >> The output of the full script is attached and it can be clearly seen
>>>> >> that for scaling factor 0.6-0.8, the performance is 5-10 times slower
>>>> >> than others.
>>>> >>
>>>> >> The assembly[2] and llvm[3] code of this function is also in the same
>>>> >> repo. I see the same behavior on both 0.3 and 0.4 and with LLVM 3.3
>>>> >> and LLVM 3.6 on two different x86_64 machine (my laptop and a linode
>>>> >> VPS) (the only platform I've tried that doesn't show similar behavior
>>>> >> is running julia 0.4 on qemu-arm....... although the performance
>>>> >> between different values also differ by ~30% which is bigger than
>>>> >> noise)
>>>> >>
>>>> >> This also seems to depend on the initial value.
>>>> >>
>>>> >> Has anyone seen similar problems before?
>>>> >>
>>>> >> Outputs:
>>>> >>
>>>> >> 325.821 milliseconds (25383 allocations: 1159 KB)
>>>> >> 307.826 milliseconds (4 allocations: 144 bytes)
>>>> >> 0.0
>>>> >>  19.227 milliseconds (2 allocations: 48 bytes)
>>>> >> 0.1
>>>> >>  17.291 milliseconds (2 allocations: 48 bytes)
>>>> >> 0.2
>>>> >>  17.404 milliseconds (2 allocations: 48 bytes)
>>>> >> 0.3
>>>> >>  19.231 milliseconds (2 allocations: 48 bytes)
>>>> >> 0.4
>>>> >>  20.278 milliseconds (2 allocations: 48 bytes)
>>>> >> 0.5
>>>> >>  23.692 milliseconds (2 allocations: 48 bytes)
>>>> >> 0.6
>>>> >> 328.107 milliseconds (2 allocations: 48 bytes)
>>>> >> 0.7
>>>> >> 312.425 milliseconds (2 allocations: 48 bytes)
>>>> >> 0.8
>>>> >> 201.494 milliseconds (2 allocations: 48 bytes)
>>>> >> 0.9
>>>> >>  16.314 milliseconds (2 allocations: 48 bytes)
>>>> >> 1.0
>>>> >>  16.264 milliseconds (2 allocations: 48 bytes)
>>>> >>
>>>> >>
>>>> >> [1]
>>>> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/array_prop.jl
>>>> >> [2]
>>>> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.S
>>>> >> [2]
>>>> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.ll

Reply via email to