Further update: I made a c++ version[1] and see a similar effect (depending on optimization levels) so it's not a julia issue (not that I think it really was to begin with...).
The slow down is presented in the c++ version for all optimization levels except Ofast and ffast-math. The julia version is faster than the default performance for both gcc and clang but is slower in the fast case for higher optmization levels. For O2 and higher, the c++ version shows a ~100x slow down for the slow case. @fast_math in julia doesn't seem to have an effect for this although it does for clang and gcc... [1] https://github.com/yuyichao/explore/blob/5a644cd46dc6f8056cee69f508f9e995b5839a01/julia/array_prop/propagate.cpp On Sun, Jul 12, 2015 at 9:23 PM, Yichao Yu <yyc1...@gmail.com> wrote: > Update: > > I've just got an even simpler version without any complex numbers and > only has Float64. The two loops are as small as the following LLVM-IR > now and there's only simple arithmetics in the loop body. > > ```llvm > L9.preheader: ; preds = %L12, > %L9.preheader.preheader > %"#s3.0" = phi i64 [ %60, %L12 ], [ 1, %L9.preheader.preheader ] > br label %L9 > > L9: ; preds = %L9, %L9.preheader > %"#s4.0" = phi i64 [ %44, %L9 ], [ 1, %L9.preheader ] > %44 = add i64 %"#s4.0", 1 > %45 = add i64 %"#s4.0", -1 > %46 = mul i64 %45, %10 > %47 = getelementptr double* %7, i64 %46 > %48 = load double* %47, align 8 > %49 = add i64 %46, 1 > %50 = getelementptr double* %7, i64 %49 > %51 = load double* %50, align 8 > %52 = fmul double %51, %3 > %53 = fmul double %38, %48 > %54 = fmul double %33, %52 > %55 = fadd double %53, %54 > store double %55, double* %50, align 8 > %56 = fmul double %38, %52 > %57 = fmul double %33, %48 > %58 = fsub double %56, %57 > store double %58, double* %47, align 8 > %59 = icmp eq i64 %"#s4.0", %12 > br i1 %59, label %L12, label %L9 > > L12: ; preds = %L9 > %60 = add i64 %"#s3.0", 1 > %61 = icmp eq i64 %"#s3.0", %42 > br i1 %61, label %L14.loopexit, label %L9.preheader > ``` > > On Sun, Jul 12, 2015 at 9:01 PM, Yichao Yu <yyc1...@gmail.com> wrote: >> On Sun, Jul 12, 2015 at 8:31 PM, Kevin Owens <kevin.j.ow...@gmail.com> wrote: >>> I can't really help you debug the IR code, but I can at least say I'm seeing >>> a similar thing. It starts to slow down around just after 0.5, and doesn't >> >> Thanks for the confirmation. At least I'm not crazy (or not the only >> one to be more precise :P ) >> >>> get back to where it was at 0.5 until 0.87. Can you compare the IR code when >>> two different values are used, to see what's different? When I tried looking >>> at the difference between 0.50 and 0.51, the biggest thing that popped out >>> to me was that the numbers after "!dbg" were different. >> >> This is exactly the strange part. I don't think either llvm or julia >> is doing constant propagation here and different input values should >> be using the same function. Evidents are >> >> 1. Julia only specialize on types not values (for now) >> 2. The function is not inlined. Which has to be the case in global >> scope and can be double checked by adding `@noinline`. It's also too >> big to be inline_worthy >> 3. No codegen is happening except the first call. This can be seen >> from the output of `@time`. Only the first call has thousands of >> allocations. Following call has less than 5 (on 0.4 at least). If >> julia can compile a function with less than 5 allocation, I'll be very >> happy..... >> 4. My original version also has much more complicated logic to compute >> that scaling factor (ok. it's just a function call, but with >> parameters gathered from different arguments) I'd be really surprised >> if either llvm or julia can reason anything about it.... >> >> The difference in the debug info should just be an artifact of >> emitting it twice. >> >>> >>> Even 0.50001 is a lot slower: >>> >>> julia> for eΓ in 0.5:0.00001:0.50015 >>> println(eΓ) >>> gc() >>> @time ψs = propagate(P, ψ0, ψs, eΓ) >>> end >>> 0.5 >>> elapsed time: 0.065609581 seconds (16 bytes allocated) >>> 0.50001 >>> elapsed time: 0.875806461 seconds (16 bytes allocated) >> >> This is actually interesting and I can confirm the same here. `0.5` >> takes 28ms while 0.5000000000000001 takes 320ms. I was thinking >> whether the cpu is doing sth special depending on the bit pattern but >> I still don't understand why it would be very bad for a certain range. >> (and is not only a function of this either, also affected by all other >> values) >> >>> >>> >>> julia> versioninfo() >>> Julia Version 0.3.9 >>> Commit 31efe69 (2015-05-30 11:24 UTC) >>> Platform Info: >>> System: Darwin (x86_64-apple-darwin13.4.0) >> >> Good. so it's not Linux only. >> >>> CPU: Intel(R) Core(TM)2 Duo CPU T8300 @ 2.40GHz >>> WORD_SIZE: 64 >>> BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn) >>> LAPACK: libopenblas >>> LIBM: libopenlibm >>> LLVM: libLLVM-3.3 >>> >>> >>> >>> >>> On Sunday, July 12, 2015 at 6:45:53 PM UTC-5, Yichao Yu wrote: >>>> >>>> On Sun, Jul 12, 2015 at 7:40 PM, Yichao Yu <yyc...@gmail.com> wrote: >>>> > P.S. Given how strange this problem is for me, I would appreciate if >>>> > anyone can confirm either this is a real issue or I'm somehow being >>>> > crazy or stupid. >>>> > >>>> >>>> One additional strange property of this issue is that I used to have >>>> much costy operations in the (outer) loop (the one that iterate over >>>> nsteps with i) like Fourier transformations. However, when the scaling >>>> factor is taking the bad value, it slows everything down (i.e. the >>>> Fourier transformation is also slower by ~10x). >>>> >>>> > >>>> > >>>> > On Sun, Jul 12, 2015 at 7:30 PM, Yichao Yu <yyc...@gmail.com> wrote: >>>> >> Hi, >>>> >> >>>> >> I've just seen a very strange (for me) performance difference for >>>> >> exactly the same code on slightly different input with no explicit >>>> >> branches. >>>> >> >>>> >> The code is available here[1]. The most relavant part is the following >>>> >> function. (All other part of the code are for initialization and bench >>>> >> mark). This is a simplified version of my similation that compute the >>>> >> next array column in the array based on the previous one. >>>> >> >>>> >> The strange part is that the performance of this function can differ >>>> >> by 10x depend on the value of the scaling factor (`eΓ`, the only use >>>> >> of which is marked in the code below) even though I don't see any >>>> >> branches that depends on that value in the relavant code. (unless the >>>> >> cpu is 10x less efficient for certain input values) >>>> >> >>>> >> function propagate(P, ψ0, ψs, eΓ) >>>> >> @inbounds for i in 1:P.nele >>>> >> ψs[1, i, 1] = ψ0[1, i] >>>> >> ψs[2, i, 1] = ψ0[2, i] >>>> >> end >>>> >> T12 = im * sin(P.Ω) >>>> >> T11 = cos(P.Ω) >>>> >> @inbounds for i in 2:(P.nstep + 1) >>>> >> for j in 1:P.nele >>>> >> ψ_e = ψs[1, j, i - 1] >>>> >> ψ_g = ψs[2, j, i - 1] * eΓ # <---- Scaling factor >>>> >> ψs[2, j, i] = T11 * ψ_e + T12 * ψ_g >>>> >> ψs[1, j, i] = T11 * ψ_g + T12 * ψ_e >>>> >> end >>>> >> end >>>> >> ψs >>>> >> end >>>> >> >>>> >> The output of the full script is attached and it can be clearly seen >>>> >> that for scaling factor 0.6-0.8, the performance is 5-10 times slower >>>> >> than others. >>>> >> >>>> >> The assembly[2] and llvm[3] code of this function is also in the same >>>> >> repo. I see the same behavior on both 0.3 and 0.4 and with LLVM 3.3 >>>> >> and LLVM 3.6 on two different x86_64 machine (my laptop and a linode >>>> >> VPS) (the only platform I've tried that doesn't show similar behavior >>>> >> is running julia 0.4 on qemu-arm....... although the performance >>>> >> between different values also differ by ~30% which is bigger than >>>> >> noise) >>>> >> >>>> >> This also seems to depend on the initial value. >>>> >> >>>> >> Has anyone seen similar problems before? >>>> >> >>>> >> Outputs: >>>> >> >>>> >> 325.821 milliseconds (25383 allocations: 1159 KB) >>>> >> 307.826 milliseconds (4 allocations: 144 bytes) >>>> >> 0.0 >>>> >> 19.227 milliseconds (2 allocations: 48 bytes) >>>> >> 0.1 >>>> >> 17.291 milliseconds (2 allocations: 48 bytes) >>>> >> 0.2 >>>> >> 17.404 milliseconds (2 allocations: 48 bytes) >>>> >> 0.3 >>>> >> 19.231 milliseconds (2 allocations: 48 bytes) >>>> >> 0.4 >>>> >> 20.278 milliseconds (2 allocations: 48 bytes) >>>> >> 0.5 >>>> >> 23.692 milliseconds (2 allocations: 48 bytes) >>>> >> 0.6 >>>> >> 328.107 milliseconds (2 allocations: 48 bytes) >>>> >> 0.7 >>>> >> 312.425 milliseconds (2 allocations: 48 bytes) >>>> >> 0.8 >>>> >> 201.494 milliseconds (2 allocations: 48 bytes) >>>> >> 0.9 >>>> >> 16.314 milliseconds (2 allocations: 48 bytes) >>>> >> 1.0 >>>> >> 16.264 milliseconds (2 allocations: 48 bytes) >>>> >> >>>> >> >>>> >> [1] >>>> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/array_prop.jl >>>> >> [2] >>>> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.S >>>> >> [2] >>>> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.ll