P.S. Given how strange this problem is for me, I would appreciate if anyone can confirm either this is a real issue or I'm somehow being crazy or stupid.
On Sun, Jul 12, 2015 at 7:30 PM, Yichao Yu <yyc1...@gmail.com> wrote: > Hi, > > I've just seen a very strange (for me) performance difference for > exactly the same code on slightly different input with no explicit > branches. > > The code is available here[1]. The most relavant part is the following > function. (All other part of the code are for initialization and bench > mark). This is a simplified version of my similation that compute the > next array column in the array based on the previous one. > > The strange part is that the performance of this function can differ > by 10x depend on the value of the scaling factor (`eΓ`, the only use > of which is marked in the code below) even though I don't see any > branches that depends on that value in the relavant code. (unless the > cpu is 10x less efficient for certain input values) > > function propagate(P, ψ0, ψs, eΓ) > @inbounds for i in 1:P.nele > ψs[1, i, 1] = ψ0[1, i] > ψs[2, i, 1] = ψ0[2, i] > end > T12 = im * sin(P.Ω) > T11 = cos(P.Ω) > @inbounds for i in 2:(P.nstep + 1) > for j in 1:P.nele > ψ_e = ψs[1, j, i - 1] > ψ_g = ψs[2, j, i - 1] * eΓ # <---- Scaling factor > ψs[2, j, i] = T11 * ψ_e + T12 * ψ_g > ψs[1, j, i] = T11 * ψ_g + T12 * ψ_e > end > end > ψs > end > > The output of the full script is attached and it can be clearly seen > that for scaling factor 0.6-0.8, the performance is 5-10 times slower > than others. > > The assembly[2] and llvm[3] code of this function is also in the same > repo. I see the same behavior on both 0.3 and 0.4 and with LLVM 3.3 > and LLVM 3.6 on two different x86_64 machine (my laptop and a linode > VPS) (the only platform I've tried that doesn't show similar behavior > is running julia 0.4 on qemu-arm....... although the performance > between different values also differ by ~30% which is bigger than > noise) > > This also seems to depend on the initial value. > > Has anyone seen similar problems before? > > Outputs: > > 325.821 milliseconds (25383 allocations: 1159 KB) > 307.826 milliseconds (4 allocations: 144 bytes) > 0.0 > 19.227 milliseconds (2 allocations: 48 bytes) > 0.1 > 17.291 milliseconds (2 allocations: 48 bytes) > 0.2 > 17.404 milliseconds (2 allocations: 48 bytes) > 0.3 > 19.231 milliseconds (2 allocations: 48 bytes) > 0.4 > 20.278 milliseconds (2 allocations: 48 bytes) > 0.5 > 23.692 milliseconds (2 allocations: 48 bytes) > 0.6 > 328.107 milliseconds (2 allocations: 48 bytes) > 0.7 > 312.425 milliseconds (2 allocations: 48 bytes) > 0.8 > 201.494 milliseconds (2 allocations: 48 bytes) > 0.9 > 16.314 milliseconds (2 allocations: 48 bytes) > 1.0 > 16.264 milliseconds (2 allocations: 48 bytes) > > > [1] > https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/array_prop.jl > [2] > https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.S > [2] > https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.ll