On Sun, Jul 12, 2015 at 8:31 PM, Kevin Owens <kevin.j.ow...@gmail.com> wrote: > I can't really help you debug the IR code, but I can at least say I'm seeing > a similar thing. It starts to slow down around just after 0.5, and doesn't
Thanks for the confirmation. At least I'm not crazy (or not the only one to be more precise :P ) > get back to where it was at 0.5 until 0.87. Can you compare the IR code when > two different values are used, to see what's different? When I tried looking > at the difference between 0.50 and 0.51, the biggest thing that popped out > to me was that the numbers after "!dbg" were different. This is exactly the strange part. I don't think either llvm or julia is doing constant propagation here and different input values should be using the same function. Evidents are 1. Julia only specialize on types not values (for now) 2. The function is not inlined. Which has to be the case in global scope and can be double checked by adding `@noinline`. It's also too big to be inline_worthy 3. No codegen is happening except the first call. This can be seen from the output of `@time`. Only the first call has thousands of allocations. Following call has less than 5 (on 0.4 at least). If julia can compile a function with less than 5 allocation, I'll be very happy..... 4. My original version also has much more complicated logic to compute that scaling factor (ok. it's just a function call, but with parameters gathered from different arguments) I'd be really surprised if either llvm or julia can reason anything about it.... The difference in the debug info should just be an artifact of emitting it twice. > > Even 0.50001 is a lot slower: > > julia> for eΓ in 0.5:0.00001:0.50015 > println(eΓ) > gc() > @time ψs = propagate(P, ψ0, ψs, eΓ) > end > 0.5 > elapsed time: 0.065609581 seconds (16 bytes allocated) > 0.50001 > elapsed time: 0.875806461 seconds (16 bytes allocated) This is actually interesting and I can confirm the same here. `0.5` takes 28ms while 0.5000000000000001 takes 320ms. I was thinking whether the cpu is doing sth special depending on the bit pattern but I still don't understand why it would be very bad for a certain range. (and is not only a function of this either, also affected by all other values) > > > julia> versioninfo() > Julia Version 0.3.9 > Commit 31efe69 (2015-05-30 11:24 UTC) > Platform Info: > System: Darwin (x86_64-apple-darwin13.4.0) Good. so it's not Linux only. > CPU: Intel(R) Core(TM)2 Duo CPU T8300 @ 2.40GHz > WORD_SIZE: 64 > BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn) > LAPACK: libopenblas > LIBM: libopenlibm > LLVM: libLLVM-3.3 > > > > > On Sunday, July 12, 2015 at 6:45:53 PM UTC-5, Yichao Yu wrote: >> >> On Sun, Jul 12, 2015 at 7:40 PM, Yichao Yu <yyc...@gmail.com> wrote: >> > P.S. Given how strange this problem is for me, I would appreciate if >> > anyone can confirm either this is a real issue or I'm somehow being >> > crazy or stupid. >> > >> >> One additional strange property of this issue is that I used to have >> much costy operations in the (outer) loop (the one that iterate over >> nsteps with i) like Fourier transformations. However, when the scaling >> factor is taking the bad value, it slows everything down (i.e. the >> Fourier transformation is also slower by ~10x). >> >> > >> > >> > On Sun, Jul 12, 2015 at 7:30 PM, Yichao Yu <yyc...@gmail.com> wrote: >> >> Hi, >> >> >> >> I've just seen a very strange (for me) performance difference for >> >> exactly the same code on slightly different input with no explicit >> >> branches. >> >> >> >> The code is available here[1]. The most relavant part is the following >> >> function. (All other part of the code are for initialization and bench >> >> mark). This is a simplified version of my similation that compute the >> >> next array column in the array based on the previous one. >> >> >> >> The strange part is that the performance of this function can differ >> >> by 10x depend on the value of the scaling factor (`eΓ`, the only use >> >> of which is marked in the code below) even though I don't see any >> >> branches that depends on that value in the relavant code. (unless the >> >> cpu is 10x less efficient for certain input values) >> >> >> >> function propagate(P, ψ0, ψs, eΓ) >> >> @inbounds for i in 1:P.nele >> >> ψs[1, i, 1] = ψ0[1, i] >> >> ψs[2, i, 1] = ψ0[2, i] >> >> end >> >> T12 = im * sin(P.Ω) >> >> T11 = cos(P.Ω) >> >> @inbounds for i in 2:(P.nstep + 1) >> >> for j in 1:P.nele >> >> ψ_e = ψs[1, j, i - 1] >> >> ψ_g = ψs[2, j, i - 1] * eΓ # <---- Scaling factor >> >> ψs[2, j, i] = T11 * ψ_e + T12 * ψ_g >> >> ψs[1, j, i] = T11 * ψ_g + T12 * ψ_e >> >> end >> >> end >> >> ψs >> >> end >> >> >> >> The output of the full script is attached and it can be clearly seen >> >> that for scaling factor 0.6-0.8, the performance is 5-10 times slower >> >> than others. >> >> >> >> The assembly[2] and llvm[3] code of this function is also in the same >> >> repo. I see the same behavior on both 0.3 and 0.4 and with LLVM 3.3 >> >> and LLVM 3.6 on two different x86_64 machine (my laptop and a linode >> >> VPS) (the only platform I've tried that doesn't show similar behavior >> >> is running julia 0.4 on qemu-arm....... although the performance >> >> between different values also differ by ~30% which is bigger than >> >> noise) >> >> >> >> This also seems to depend on the initial value. >> >> >> >> Has anyone seen similar problems before? >> >> >> >> Outputs: >> >> >> >> 325.821 milliseconds (25383 allocations: 1159 KB) >> >> 307.826 milliseconds (4 allocations: 144 bytes) >> >> 0.0 >> >> 19.227 milliseconds (2 allocations: 48 bytes) >> >> 0.1 >> >> 17.291 milliseconds (2 allocations: 48 bytes) >> >> 0.2 >> >> 17.404 milliseconds (2 allocations: 48 bytes) >> >> 0.3 >> >> 19.231 milliseconds (2 allocations: 48 bytes) >> >> 0.4 >> >> 20.278 milliseconds (2 allocations: 48 bytes) >> >> 0.5 >> >> 23.692 milliseconds (2 allocations: 48 bytes) >> >> 0.6 >> >> 328.107 milliseconds (2 allocations: 48 bytes) >> >> 0.7 >> >> 312.425 milliseconds (2 allocations: 48 bytes) >> >> 0.8 >> >> 201.494 milliseconds (2 allocations: 48 bytes) >> >> 0.9 >> >> 16.314 milliseconds (2 allocations: 48 bytes) >> >> 1.0 >> >> 16.264 milliseconds (2 allocations: 48 bytes) >> >> >> >> >> >> [1] >> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/array_prop.jl >> >> [2] >> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.S >> >> [2] >> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.ll