On Sun, Jul 12, 2015 at 10:30 PM, Yichao Yu <yyc1...@gmail.com> wrote: > Further update: > > I made a c++ version[1] and see a similar effect (depending on > optimization levels) so it's not a julia issue (not that I think it > really was to begin with...).
After investigating the c++ version more, I find that the difference between the fast_math and the non-fast_math version is that the compiler emit a function called `set_fast_math` (see below). >From what I can tell, the function sets bit 6 and bit 15 on the MXCSR register (for SSE) and according to this page[1] these are DAZ and FZ bits (both related to underflow). It also describe denormals as "take considerably longer to process". Since the operation I have keeps decreasing the value, I guess it makes sense that there's a value dependent performance.... (and it kind of make sense that fft also suffers from these values) So now the question is: 1. How important are underflow and denormal values? Note that I'm not catching underflow explicitly anyway and I don't really care about values that are really small compare to 1. 2. Is there a way to set up the SSE registers as done by the c compilers? @fastmath does not seems to be doing this. 00000000000005b0 <set_fast_math>: 5b0: 0f ae 5c 24 fc stmxcsr -0x4(%rsp) 5b5: 81 4c 24 fc 40 80 00 orl $0x8040,-0x4(%rsp) 5bc: 00 5bd: 0f ae 54 24 fc ldmxcsr -0x4(%rsp) 5c2: c3 retq 5c3: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1) 5ca: 00 00 00 5cd: 0f 1f 00 nopl (%rax) [1] http://softpixel.com/~cwright/programming/simd/sse.php > > The slow down is presented in the c++ version for all optimization > levels except Ofast and ffast-math. The julia version is faster than > the default performance for both gcc and clang but is slower in the > fast case for higher optmization levels. For O2 and higher, the c++ The slowness of the julia version seems to be due to multi dimentional arrays. Using 1d array yields similar performance with C. > version shows a ~100x slow down for the slow case. > > @fast_math in julia doesn't seem to have an effect for this although > it does for clang and gcc... > > [1] > https://github.com/yuyichao/explore/blob/5a644cd46dc6f8056cee69f508f9e995b5839a01/julia/array_prop/propagate.cpp > > On Sun, Jul 12, 2015 at 9:23 PM, Yichao Yu <yyc1...@gmail.com> wrote: >> Update: >> >> I've just got an even simpler version without any complex numbers and >> only has Float64. The two loops are as small as the following LLVM-IR >> now and there's only simple arithmetics in the loop body. >> >> ```llvm >> L9.preheader: ; preds = %L12, >> %L9.preheader.preheader >> %"#s3.0" = phi i64 [ %60, %L12 ], [ 1, %L9.preheader.preheader ] >> br label %L9 >> >> L9: ; preds = %L9, >> %L9.preheader >> %"#s4.0" = phi i64 [ %44, %L9 ], [ 1, %L9.preheader ] >> %44 = add i64 %"#s4.0", 1 >> %45 = add i64 %"#s4.0", -1 >> %46 = mul i64 %45, %10 >> %47 = getelementptr double* %7, i64 %46 >> %48 = load double* %47, align 8 >> %49 = add i64 %46, 1 >> %50 = getelementptr double* %7, i64 %49 >> %51 = load double* %50, align 8 >> %52 = fmul double %51, %3 >> %53 = fmul double %38, %48 >> %54 = fmul double %33, %52 >> %55 = fadd double %53, %54 >> store double %55, double* %50, align 8 >> %56 = fmul double %38, %52 >> %57 = fmul double %33, %48 >> %58 = fsub double %56, %57 >> store double %58, double* %47, align 8 >> %59 = icmp eq i64 %"#s4.0", %12 >> br i1 %59, label %L12, label %L9 >> >> L12: ; preds = %L9 >> %60 = add i64 %"#s3.0", 1 >> %61 = icmp eq i64 %"#s3.0", %42 >> br i1 %61, label %L14.loopexit, label %L9.preheader >> ``` >> >> On Sun, Jul 12, 2015 at 9:01 PM, Yichao Yu <yyc1...@gmail.com> wrote: >>> On Sun, Jul 12, 2015 at 8:31 PM, Kevin Owens <kevin.j.ow...@gmail.com> >>> wrote: >>>> I can't really help you debug the IR code, but I can at least say I'm >>>> seeing >>>> a similar thing. It starts to slow down around just after 0.5, and doesn't >>> >>> Thanks for the confirmation. At least I'm not crazy (or not the only >>> one to be more precise :P ) >>> >>>> get back to where it was at 0.5 until 0.87. Can you compare the IR code >>>> when >>>> two different values are used, to see what's different? When I tried >>>> looking >>>> at the difference between 0.50 and 0.51, the biggest thing that popped out >>>> to me was that the numbers after "!dbg" were different. >>> >>> This is exactly the strange part. I don't think either llvm or julia >>> is doing constant propagation here and different input values should >>> be using the same function. Evidents are >>> >>> 1. Julia only specialize on types not values (for now) >>> 2. The function is not inlined. Which has to be the case in global >>> scope and can be double checked by adding `@noinline`. It's also too >>> big to be inline_worthy >>> 3. No codegen is happening except the first call. This can be seen >>> from the output of `@time`. Only the first call has thousands of >>> allocations. Following call has less than 5 (on 0.4 at least). If >>> julia can compile a function with less than 5 allocation, I'll be very >>> happy..... >>> 4. My original version also has much more complicated logic to compute >>> that scaling factor (ok. it's just a function call, but with >>> parameters gathered from different arguments) I'd be really surprised >>> if either llvm or julia can reason anything about it.... >>> >>> The difference in the debug info should just be an artifact of >>> emitting it twice. >>> >>>> >>>> Even 0.50001 is a lot slower: >>>> >>>> julia> for eΓ in 0.5:0.00001:0.50015 >>>> println(eΓ) >>>> gc() >>>> @time ψs = propagate(P, ψ0, ψs, eΓ) >>>> end >>>> 0.5 >>>> elapsed time: 0.065609581 seconds (16 bytes allocated) >>>> 0.50001 >>>> elapsed time: 0.875806461 seconds (16 bytes allocated) >>> >>> This is actually interesting and I can confirm the same here. `0.5` >>> takes 28ms while 0.5000000000000001 takes 320ms. I was thinking >>> whether the cpu is doing sth special depending on the bit pattern but >>> I still don't understand why it would be very bad for a certain range. >>> (and is not only a function of this either, also affected by all other >>> values) >>> >>>> >>>> >>>> julia> versioninfo() >>>> Julia Version 0.3.9 >>>> Commit 31efe69 (2015-05-30 11:24 UTC) >>>> Platform Info: >>>> System: Darwin (x86_64-apple-darwin13.4.0) >>> >>> Good. so it's not Linux only. >>> >>>> CPU: Intel(R) Core(TM)2 Duo CPU T8300 @ 2.40GHz >>>> WORD_SIZE: 64 >>>> BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Penryn) >>>> LAPACK: libopenblas >>>> LIBM: libopenlibm >>>> LLVM: libLLVM-3.3 >>>> >>>> >>>> >>>> >>>> On Sunday, July 12, 2015 at 6:45:53 PM UTC-5, Yichao Yu wrote: >>>>> >>>>> On Sun, Jul 12, 2015 at 7:40 PM, Yichao Yu <yyc...@gmail.com> wrote: >>>>> > P.S. Given how strange this problem is for me, I would appreciate if >>>>> > anyone can confirm either this is a real issue or I'm somehow being >>>>> > crazy or stupid. >>>>> > >>>>> >>>>> One additional strange property of this issue is that I used to have >>>>> much costy operations in the (outer) loop (the one that iterate over >>>>> nsteps with i) like Fourier transformations. However, when the scaling >>>>> factor is taking the bad value, it slows everything down (i.e. the >>>>> Fourier transformation is also slower by ~10x). >>>>> >>>>> > >>>>> > >>>>> > On Sun, Jul 12, 2015 at 7:30 PM, Yichao Yu <yyc...@gmail.com> wrote: >>>>> >> Hi, >>>>> >> >>>>> >> I've just seen a very strange (for me) performance difference for >>>>> >> exactly the same code on slightly different input with no explicit >>>>> >> branches. >>>>> >> >>>>> >> The code is available here[1]. The most relavant part is the following >>>>> >> function. (All other part of the code are for initialization and bench >>>>> >> mark). This is a simplified version of my similation that compute the >>>>> >> next array column in the array based on the previous one. >>>>> >> >>>>> >> The strange part is that the performance of this function can differ >>>>> >> by 10x depend on the value of the scaling factor (`eΓ`, the only use >>>>> >> of which is marked in the code below) even though I don't see any >>>>> >> branches that depends on that value in the relavant code. (unless the >>>>> >> cpu is 10x less efficient for certain input values) >>>>> >> >>>>> >> function propagate(P, ψ0, ψs, eΓ) >>>>> >> @inbounds for i in 1:P.nele >>>>> >> ψs[1, i, 1] = ψ0[1, i] >>>>> >> ψs[2, i, 1] = ψ0[2, i] >>>>> >> end >>>>> >> T12 = im * sin(P.Ω) >>>>> >> T11 = cos(P.Ω) >>>>> >> @inbounds for i in 2:(P.nstep + 1) >>>>> >> for j in 1:P.nele >>>>> >> ψ_e = ψs[1, j, i - 1] >>>>> >> ψ_g = ψs[2, j, i - 1] * eΓ # <---- Scaling factor >>>>> >> ψs[2, j, i] = T11 * ψ_e + T12 * ψ_g >>>>> >> ψs[1, j, i] = T11 * ψ_g + T12 * ψ_e >>>>> >> end >>>>> >> end >>>>> >> ψs >>>>> >> end >>>>> >> >>>>> >> The output of the full script is attached and it can be clearly seen >>>>> >> that for scaling factor 0.6-0.8, the performance is 5-10 times slower >>>>> >> than others. >>>>> >> >>>>> >> The assembly[2] and llvm[3] code of this function is also in the same >>>>> >> repo. I see the same behavior on both 0.3 and 0.4 and with LLVM 3.3 >>>>> >> and LLVM 3.6 on two different x86_64 machine (my laptop and a linode >>>>> >> VPS) (the only platform I've tried that doesn't show similar behavior >>>>> >> is running julia 0.4 on qemu-arm....... although the performance >>>>> >> between different values also differ by ~30% which is bigger than >>>>> >> noise) >>>>> >> >>>>> >> This also seems to depend on the initial value. >>>>> >> >>>>> >> Has anyone seen similar problems before? >>>>> >> >>>>> >> Outputs: >>>>> >> >>>>> >> 325.821 milliseconds (25383 allocations: 1159 KB) >>>>> >> 307.826 milliseconds (4 allocations: 144 bytes) >>>>> >> 0.0 >>>>> >> 19.227 milliseconds (2 allocations: 48 bytes) >>>>> >> 0.1 >>>>> >> 17.291 milliseconds (2 allocations: 48 bytes) >>>>> >> 0.2 >>>>> >> 17.404 milliseconds (2 allocations: 48 bytes) >>>>> >> 0.3 >>>>> >> 19.231 milliseconds (2 allocations: 48 bytes) >>>>> >> 0.4 >>>>> >> 20.278 milliseconds (2 allocations: 48 bytes) >>>>> >> 0.5 >>>>> >> 23.692 milliseconds (2 allocations: 48 bytes) >>>>> >> 0.6 >>>>> >> 328.107 milliseconds (2 allocations: 48 bytes) >>>>> >> 0.7 >>>>> >> 312.425 milliseconds (2 allocations: 48 bytes) >>>>> >> 0.8 >>>>> >> 201.494 milliseconds (2 allocations: 48 bytes) >>>>> >> 0.9 >>>>> >> 16.314 milliseconds (2 allocations: 48 bytes) >>>>> >> 1.0 >>>>> >> 16.264 milliseconds (2 allocations: 48 bytes) >>>>> >> >>>>> >> >>>>> >> [1] >>>>> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/array_prop.jl >>>>> >> [2] >>>>> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.S >>>>> >> [2] >>>>> >> https://github.com/yuyichao/explore/blob/e4be0151df33571c1c22f54fe044c929ca821c46/julia/array_prop/propagate.ll