Re: [julia-users] Re: Finite element code in Julia: Curious overhead in .* product
Valentin, Thank you so much for this valuable information. I would define decent, as does the Julia community with respect to Julia itself, as being within a small factor of the best performance. As Tim Holy demonstrated with his example, these two implementations run about a factor of 30 apart: Fe += Ns * (f * Jac); #30 times slower than for k=1:length(Fe) Fe[k]+=Ns[k]*(f * Jac);#this expanded loop end So that seems like a lot. As you point out, one might be willing to write it like this if to decide between two pains (losing a readable code, or losing a lot of performance) meant that loops win by a lot. Petr On Thursday, December 11, 2014 12:18:15 AM UTC-8, Valentin Churavy wrote: > > Dear Petr, > > I tought the point of having the array objects and the associated >> manipulation functions was to hide the loops while delivering decent >> performance... > > > This assumption is very true in Matlab. Matlab spend an enormous > engineering amount on optimizing code that uses vectorization and array > objects (and it is still slow). There has been ongoing discussions over on > Github [1] on how to improve the current situation in Julia. One of the > major selling points of Julia for me is the fact that it is quite > transparent on which kind of optimizations it requires. I can write in a > very dynamic style with a lot of matrix operations like in Matlab and still > get decent performance or I can go in and identify with tools like @profile > [2] what the pains point in my program are. > > The point of using vectorized operations in Matlab is that is the one > reliable way to get good performance in matlab, because all underlying > functions are written in C. In Julia most underlying functions are written > in Julia (note that most mathematical operations call out to C libraries. > No need to reinvent the wheel). There is a Julia package you can use for > porting over Matlab code that devectorizes you code [3], but if the > operation is occurring in a hot loop it still will be a good and necessary > optimization to unroll the vectorized code. Then you no longer get just > decent performance, but excellent performance instead. > > Best Valentin > > PS: > You can also use @which 1 .* [1] and @edit to see where the operation is > defined and how it is implemented. The .* operation is implemented by > allocating an output array and the running * element wise over the array. > No magic is going on there. > > [1] https://github.com/JuliaLang/julia/issues/8450 > [2] http://docs.julialang.org/en/release-0.3/stdlib/profile/ > [3] https://github.com/lindahua/Devectorize.jl > > On Thursday, 11 December 2014 06:23:29 UTC+1, Petr Krysl wrote: >> >> Thanks. Now my head is really spinning! >> >> See, before I posted the original question I tried expanding the loop in >> the actual FE code, and the code was SLOWER and was using MORE memory: >> >> With the expression >> Fe += Ns[j] * (f * Jac * w[j]); : >> 6.223416655 seconds (1648832052 bytes >> >> With the expanded loop >> for kx=1:length(Fe) # alternative (devectorization) >> Fe[kx] += Ns[j][kx] * (f * Jac * w[j]); >> end >> 7.340272676 seconds (1776971604 bytes allocated, >> >> In addition, your argument clearly demonstrates how to avoid the >> temporary array for doit1(), but doit2() adds to the 3 x 1 one additional >> 1x1 temporary (it seems to me), yet it is about 14 times slower. Why is >> that? >> >> Finally, if the only way I can get decent performance with lines like >> >> Fe += Ns[j] * (f * Jac * w[j]); # Fe and Ns[j] arrays >> >> is to manually write out all the loops, that would be terrible news >> indeed. Not only that is a lot of work when rewriting loads of Matlab code >> (with several matrices concatenated in many many expressions), but the >> legibility and maintainability tanks. I tought the point of having the >> array objects and the associated manipulation functions was to hide the >> loops while delivering decent performance... >> >> Petr >> >> >> On Wednesday, December 10, 2014 8:28:31 PM UTC-8, Tim Holy wrote: >>> >>> Multiplying two Float64s yields another Float64; most likely, this will >>> be >>> stored in the CPU's registers. In contrast, [f]*Jac creates an array, >>> on each >>> iteration, that has to be stored on the heap. >>> >>> A faster variant devectorizes: >>>function doit1a(N) >>>Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; >>>for i=1:N >>>tmp = f*Jac >>>for j = 1:length(Fe) >>>Fe[j] += Ns[j]*tmp >>>end >>>end >>>Fe >>>end >>> >>> julia> @time doit1(N); >>> elapsed time: 0.810270399 seconds (384000320 bytes allocated, 61.23% gc >>> time) >>> >>> julia> @time doit1a(N); >>> elapsed time: 0.022118726 seconds (320 bytes allocated) >>> >>> Note the tiny allocations in the second case. >>> >>> --Tim >>> >
Re: [julia-users] Re: Finite element code in Julia: Curious overhead in .* product
Dear Petr, I tought the point of having the array objects and the associated > manipulation functions was to hide the loops while delivering decent > performance... This assumption is very true in Matlab. Matlab spend an enormous engineering amount on optimizing code that uses vectorization and array objects (and it is still slow). There has been ongoing discussions over on Github [1] on how to improve the current situation in Julia. One of the major selling points of Julia for me is the fact that it is quite transparent on which kind of optimizations it requires. I can write in a very dynamic style with a lot of matrix operations like in Matlab and still get decent performance or I can go in and identify with tools like @profile [2] what the pains point in my program are. The point of using vectorized operations in Matlab is that is the one reliable way to get good performance in matlab, because all underlying functions are written in C. In Julia most underlying functions are written in Julia (note that most mathematical operations call out to C libraries. No need to reinvent the wheel). There is a Julia package you can use for porting over Matlab code that devectorizes you code [3], but if the operation is occurring in a hot loop it still will be a good and necessary optimization to unroll the vectorized code. Then you no longer get just decent performance, but excellent performance instead. Best Valentin PS: You can also use @which 1 .* [1] and @edit to see where the operation is defined and how it is implemented. The .* operation is implemented by allocating an output array and the running * element wise over the array. No magic is going on there. [1] https://github.com/JuliaLang/julia/issues/8450 [2] http://docs.julialang.org/en/release-0.3/stdlib/profile/ [3] https://github.com/lindahua/Devectorize.jl On Thursday, 11 December 2014 06:23:29 UTC+1, Petr Krysl wrote: > > Thanks. Now my head is really spinning! > > See, before I posted the original question I tried expanding the loop in > the actual FE code, and the code was SLOWER and was using MORE memory: > > With the expression > Fe += Ns[j] * (f * Jac * w[j]); : > 6.223416655 seconds (1648832052 bytes > > With the expanded loop > for kx=1:length(Fe) # alternative (devectorization) > Fe[kx] += Ns[j][kx] * (f * Jac * w[j]); > end > 7.340272676 seconds (1776971604 bytes allocated, > > In addition, your argument clearly demonstrates how to avoid the temporary > array for doit1(), but doit2() adds to the 3 x 1 one additional 1x1 > temporary (it seems to me), yet it is about 14 times slower. Why is that? > > Finally, if the only way I can get decent performance with lines like > > Fe += Ns[j] * (f * Jac * w[j]); # Fe and Ns[j] arrays > > is to manually write out all the loops, that would be terrible news > indeed. Not only that is a lot of work when rewriting loads of Matlab code > (with several matrices concatenated in many many expressions), but the > legibility and maintainability tanks. I tought the point of having the > array objects and the associated manipulation functions was to hide the > loops while delivering decent performance... > > Petr > > > On Wednesday, December 10, 2014 8:28:31 PM UTC-8, Tim Holy wrote: >> >> Multiplying two Float64s yields another Float64; most likely, this will >> be >> stored in the CPU's registers. In contrast, [f]*Jac creates an array, on >> each >> iteration, that has to be stored on the heap. >> >> A faster variant devectorizes: >>function doit1a(N) >>Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; >>for i=1:N >>tmp = f*Jac >>for j = 1:length(Fe) >>Fe[j] += Ns[j]*tmp >>end >>end >>Fe >>end >> >> julia> @time doit1(N); >> elapsed time: 0.810270399 seconds (384000320 bytes allocated, 61.23% gc >> time) >> >> julia> @time doit1a(N); >> elapsed time: 0.022118726 seconds (320 bytes allocated) >> >> Note the tiny allocations in the second case. >> >> --Tim >> >> >> On Wednesday, December 10, 2014 07:54:00 PM Petr Krysl wrote: >> > The code is really short: >> > >> > N=200 >> > >> > function doit1(N) >> > Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; >> > for i=1:N >> > Fe += Ns * (f * Jac); >> > end >> > Fe >> > end >> > >> > function doit2(N) >> > Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; >> > for i=1:N >> > Fe += Ns .* ([f] * Jac); >> > end >> > Fe >> > end >> > >> > function doit3(N) >> > Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; >> > fs=[1.0] >> > for i=1:N >> > fs= ([f] * Jac ); Fe += Ns .* fs; >> > end >> > Fe >> > end >> > >> > function doit4(N) >> > Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; >> > fs=[1.0]
Re: [julia-users] Re: Finite element code in Julia: Curious overhead in .* product
Thanks. Now my head is really spinning! See, before I posted the original question I tried expanding the loop in the actual FE code, and the code was SLOWER and was using MORE memory: With the expression Fe += Ns[j] * (f * Jac * w[j]); : 6.223416655 seconds (1648832052 bytes With the expanded loop for kx=1:length(Fe) # alternative (devectorization) Fe[kx] += Ns[j][kx] * (f * Jac * w[j]); end 7.340272676 seconds (1776971604 bytes allocated, In addition, your argument clearly demonstrates how to avoid the temporary array for doit1(), but doit2() adds to the 3 x 1 one additional 1x1 temporary (it seems to me), yet it is about 14 times slower. Why is that? Finally, if the only way I can get decent performance with lines like Fe += Ns[j] * (f * Jac * w[j]); # Fe and Ns[j] arrays is to manually write out all the loops, that would be terrible news indeed. Not only that is a lot of work when rewriting loads of Matlab code (with several matrices concatenated in many many expressions), but the legibility and maintainability tanks. I tought the point of having the array objects and the associated manipulation functions was to hide the loops while delivering decent performance... Petr On Wednesday, December 10, 2014 8:28:31 PM UTC-8, Tim Holy wrote: > > Multiplying two Float64s yields another Float64; most likely, this will be > stored in the CPU's registers. In contrast, [f]*Jac creates an array, on > each > iteration, that has to be stored on the heap. > > A faster variant devectorizes: >function doit1a(N) >Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; >for i=1:N >tmp = f*Jac >for j = 1:length(Fe) >Fe[j] += Ns[j]*tmp >end >end >Fe >end > > julia> @time doit1(N); > elapsed time: 0.810270399 seconds (384000320 bytes allocated, 61.23% gc > time) > > julia> @time doit1a(N); > elapsed time: 0.022118726 seconds (320 bytes allocated) > > Note the tiny allocations in the second case. > > --Tim > > > On Wednesday, December 10, 2014 07:54:00 PM Petr Krysl wrote: > > The code is really short: > > > > N=200 > > > > function doit1(N) > > Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; > > for i=1:N > > Fe += Ns * (f * Jac); > > end > > Fe > > end > > > > function doit2(N) > > Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; > > for i=1:N > > Fe += Ns .* ([f] * Jac); > > end > > Fe > > end > > > > function doit3(N) > > Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; > > fs=[1.0] > > for i=1:N > > fs= ([f] * Jac ); Fe += Ns .* fs; > > end > > Fe > > end > > > > function doit4(N) > > Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; > > fs=[1.0] > > for i=1:N # > > fs= [f]; fs *= Jac; Fe += Ns .* fs; > > end > > Fe > > end > > # > > @time doit1(N) > > @time doit2(N) > > @time doit3(N) > > @time doit4(N) > > > > Here are the measurements on my machine: > > > > elapsed time: 0.461173619 seconds (384129944 bytes allocated, 44.45% gc > > time) > > elapsed time: 6.905249901 seconds (1904151036 bytes allocated, 12.84% gc > > time) > > elapsed time: 6.862259146 seconds (1904166072 bytes allocated, 13.23% gc > > time) > > elapsed time: 6.842678542 seconds (1904166256 bytes allocated, 12.81% gc > > time) > > > > I'll be grateful for any pointers as to how to structure the code so > that > > the array operations not incur this horrendous penalty. > > > > Petr > > > > On Wednesday, December 10, 2014 7:06:43 PM UTC-8, Tim Holy wrote: > > > Can you post short but complete code examples in a gist? > > > https://gist.github.com/ > > > That would make it easier to follow than chasing code examples across > long > > > email chains. > > > > > > --Tim > > > > > > On Wednesday, December 10, 2014 02:35:38 PM Petr Krysl wrote: > > > > I don't know if this is correct, but here is a guess: > > > > > > > > Option 3 still requires a temp array ( to calculate the result of > the > > > > paren fs= ([f] * Jac * w[j]); ), and option 4 eliminates that temp. > The > > > > cost of the temp over the 2 million loops is ~200MB and 0.6 sec CPU > > > > > > time. > > > > > > > So WHY is the difference between 1 and 2 so HUUUGE? > > > > > > > > I think this calls for someone who wrote the compiler. Guys? > > > > > > > > Thanks a bunch, > > > > > > > > P > > > > > > > > On Wednesday, December 10, 2014 12:51:26 PM UTC-8, Petr Krysl wrote: > > > > > Actually: option (4) was also tested: > > > > > # 16.333766821 seconds (3008899660 bytes > > > > > fs[1]= f; fs *= (Jac * w[j]); > > > > > > > > > > Fe += Ns[j] .* fs; > > > > > > > > > > So, allocation of memory was reduced somewhat, runtime not so
Re: [julia-users] Re: Finite element code in Julia: Curious overhead in .* product
Multiplying two Float64s yields another Float64; most likely, this will be stored in the CPU's registers. In contrast, [f]*Jac creates an array, on each iteration, that has to be stored on the heap. A faster variant devectorizes: function doit1a(N) Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; for i=1:N tmp = f*Jac for j = 1:length(Fe) Fe[j] += Ns[j]*tmp end end Fe end julia> @time doit1(N); elapsed time: 0.810270399 seconds (384000320 bytes allocated, 61.23% gc time) julia> @time doit1a(N); elapsed time: 0.022118726 seconds (320 bytes allocated) Note the tiny allocations in the second case. --Tim On Wednesday, December 10, 2014 07:54:00 PM Petr Krysl wrote: > The code is really short: > > N=200 > > function doit1(N) > Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; > for i=1:N > Fe += Ns * (f * Jac); > end > Fe > end > > function doit2(N) > Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; > for i=1:N > Fe += Ns .* ([f] * Jac); > end > Fe > end > > function doit3(N) > Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; > fs=[1.0] > for i=1:N > fs= ([f] * Jac ); Fe += Ns .* fs; > end > Fe > end > > function doit4(N) > Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; > fs=[1.0] > for i=1:N # > fs= [f]; fs *= Jac; Fe += Ns .* fs; > end > Fe > end > # > @time doit1(N) > @time doit2(N) > @time doit3(N) > @time doit4(N) > > Here are the measurements on my machine: > > elapsed time: 0.461173619 seconds (384129944 bytes allocated, 44.45% gc > time) > elapsed time: 6.905249901 seconds (1904151036 bytes allocated, 12.84% gc > time) > elapsed time: 6.862259146 seconds (1904166072 bytes allocated, 13.23% gc > time) > elapsed time: 6.842678542 seconds (1904166256 bytes allocated, 12.81% gc > time) > > I'll be grateful for any pointers as to how to structure the code so that > the array operations not incur this horrendous penalty. > > Petr > > On Wednesday, December 10, 2014 7:06:43 PM UTC-8, Tim Holy wrote: > > Can you post short but complete code examples in a gist? > > https://gist.github.com/ > > That would make it easier to follow than chasing code examples across long > > email chains. > > > > --Tim > > > > On Wednesday, December 10, 2014 02:35:38 PM Petr Krysl wrote: > > > I don't know if this is correct, but here is a guess: > > > > > > Option 3 still requires a temp array ( to calculate the result of the > > > paren fs= ([f] * Jac * w[j]); ), and option 4 eliminates that temp. The > > > cost of the temp over the 2 million loops is ~200MB and 0.6 sec CPU > > > > time. > > > > > So WHY is the difference between 1 and 2 so HUUUGE? > > > > > > I think this calls for someone who wrote the compiler. Guys? > > > > > > Thanks a bunch, > > > > > > P > > > > > > On Wednesday, December 10, 2014 12:51:26 PM UTC-8, Petr Krysl wrote: > > > > Actually: option (4) was also tested: > > > > # 16.333766821 seconds (3008899660 bytes > > > > fs[1]= f; fs *= (Jac * w[j]); > > > > > > > > Fe += Ns[j] .* fs; > > > > > > > > So, allocation of memory was reduced somewhat, runtime not so much. > > > > > > > > On Wednesday, December 10, 2014 12:45:20 PM UTC-8, Petr Krysl wrote: > > > >> Well, temporary array was also on my mind. However, things are I > > > >> believe a little bit more complicated. > > > >> > > > >> Here is the code with three timed options. As you can see, the first > > > > two > > > > > >> options are the fast one (multiplication with a scalar) and the slow > > > > one > > > > > >> (multiplication with a one by one matrix). In the third option I > > > > tried > > > > > >> to > > > >> avoid the creation of an ad hoc temporary by allocating a variable > > > >> outside > > > >> of the loop. The effect unfortunately is nil. > > > >> > > > >> fs=[0.0]# Used only for option (3) > > > >> # Now loop over all fes in the block > > > >> for i=1:size(conns,1) > > > >> > > > >> ... > > > >> for j=1:npts > > > >> > > > >>... > > > >> > > > >> # Option (1): 7.193767019 seconds (1648850568 bytes > > > >> # Fe += Ns[j] * (f * Jac * w[j]); # > > > >> > > > >># Option (2): 17.301214583 seconds (3244458368 bytes > > > >> > > > >> #Fe += Ns[j] .* ([f] * Jac * w[j]); # > > > >> > > > >># Option (3): 16.943314075 seconds (3232879120 > > > >> > > > >> fs= ([f] * Jac * w[j]); Fe += Ns[j] .* fs; > > > >> > > > >> end > > > >> ... > > > >> > > > >> end > > > >> > > > >> What do you think? Why is the code still getting hit with a big > > > >> performance/memory penal
Re: [julia-users] Re: Finite element code in Julia: Curious overhead in .* product
https://gist.github.com/anonymous/054f6fd269f97a4442db On Wednesday, December 10, 2014 7:06:43 PM UTC-8, Tim Holy wrote: > > Can you post short but complete code examples in a gist? > https://gist.github.com/ > That would make it easier to follow than chasing code examples across long > email chains. > > --Tim > > > On Wednesday, December 10, 2014 02:35:38 PM Petr Krysl wrote: > > I don't know if this is correct, but here is a guess: > > > > Option 3 still requires a temp array ( to calculate the result of the > > paren fs= ([f] * Jac * w[j]); ), and option 4 eliminates that temp. The > > cost of the temp over the 2 million loops is ~200MB and 0.6 sec CPU > time. > > So WHY is the difference between 1 and 2 so HUUUGE? > > > > I think this calls for someone who wrote the compiler. Guys? > > > > Thanks a bunch, > > > > P > > > > On Wednesday, December 10, 2014 12:51:26 PM UTC-8, Petr Krysl wrote: > > > Actually: option (4) was also tested: > > > # 16.333766821 seconds (3008899660 bytes > > > fs[1]= f; fs *= (Jac * w[j]); > > > > > > Fe += Ns[j] .* fs; > > > > > > So, allocation of memory was reduced somewhat, runtime not so much. > > > > > > On Wednesday, December 10, 2014 12:45:20 PM UTC-8, Petr Krysl wrote: > > >> Well, temporary array was also on my mind. However, things are I > > >> believe a little bit more complicated. > > >> > > >> Here is the code with three timed options. As you can see, the first > two > > >> options are the fast one (multiplication with a scalar) and the slow > one > > >> (multiplication with a one by one matrix). In the third option I > tried > > >> to > > >> avoid the creation of an ad hoc temporary by allocating a variable > > >> outside > > >> of the loop. The effect unfortunately is nil. > > >> > > >> fs=[0.0]# Used only for option (3) > > >> # Now loop over all fes in the block > > >> for i=1:size(conns,1) > > >> > > >> ... > > >> for j=1:npts > > >> > > >>... > > >> > > >> # Option (1): 7.193767019 seconds (1648850568 bytes > > >> # Fe += Ns[j] * (f * Jac * w[j]); # > > >> > > >># Option (2): 17.301214583 seconds (3244458368 bytes > > >> > > >> #Fe += Ns[j] .* ([f] * Jac * w[j]); # > > >> > > >># Option (3): 16.943314075 seconds (3232879120 > > >> > > >> fs= ([f] * Jac * w[j]); Fe += Ns[j] .* fs; > > >> > > >> end > > >> ... > > >> > > >> end > > >> > > >> What do you think? Why is the code still getting hit with a big > > >> performance/memory penalty? > > >> > > >> Thanks, > > >> > > >> Petr > > >> > > >> On Monday, December 8, 2014 2:03:02 PM UTC-8, Valentin Churavy wrote: > > >>> I would think that when f is a 1x1 matrix Julia is allocating a new > 1x1 > > >>> matrix to store the result. If it is a scalar that allocation can be > > >>> skipped. When this part of the code is now in a hot loop it might > happen > > >>> that you allocate millions of very small short-lived objects and > that > > >>> taxes > > >>> the GC quite a lot. > >
Re: [julia-users] Re: Finite element code in Julia: Curious overhead in .* product
The code is really short: N=200 function doit1(N) Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; for i=1:N Fe += Ns * (f * Jac); end Fe end function doit2(N) Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; for i=1:N Fe += Ns .* ([f] * Jac); end Fe end function doit3(N) Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; fs=[1.0] for i=1:N fs= ([f] * Jac ); Fe += Ns .* fs; end Fe end function doit4(N) Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; fs=[1.0] for i=1:N # fs= [f]; fs *= Jac; Fe += Ns .* fs; end Fe end # @time doit1(N) @time doit2(N) @time doit3(N) @time doit4(N) Here are the measurements on my machine: elapsed time: 0.461173619 seconds (384129944 bytes allocated, 44.45% gc time) elapsed time: 6.905249901 seconds (1904151036 bytes allocated, 12.84% gc time) elapsed time: 6.862259146 seconds (1904166072 bytes allocated, 13.23% gc time) elapsed time: 6.842678542 seconds (1904166256 bytes allocated, 12.81% gc time) I'll be grateful for any pointers as to how to structure the code so that the array operations not incur this horrendous penalty. Petr On Wednesday, December 10, 2014 7:06:43 PM UTC-8, Tim Holy wrote: > > Can you post short but complete code examples in a gist? > https://gist.github.com/ > That would make it easier to follow than chasing code examples across long > email chains. > > --Tim > > > On Wednesday, December 10, 2014 02:35:38 PM Petr Krysl wrote: > > I don't know if this is correct, but here is a guess: > > > > Option 3 still requires a temp array ( to calculate the result of the > > paren fs= ([f] * Jac * w[j]); ), and option 4 eliminates that temp. The > > cost of the temp over the 2 million loops is ~200MB and 0.6 sec CPU > time. > > So WHY is the difference between 1 and 2 so HUUUGE? > > > > I think this calls for someone who wrote the compiler. Guys? > > > > Thanks a bunch, > > > > P > > > > On Wednesday, December 10, 2014 12:51:26 PM UTC-8, Petr Krysl wrote: > > > Actually: option (4) was also tested: > > > # 16.333766821 seconds (3008899660 bytes > > > fs[1]= f; fs *= (Jac * w[j]); > > > > > > Fe += Ns[j] .* fs; > > > > > > So, allocation of memory was reduced somewhat, runtime not so much. > > > > > > On Wednesday, December 10, 2014 12:45:20 PM UTC-8, Petr Krysl wrote: > > >> Well, temporary array was also on my mind. However, things are I > > >> believe a little bit more complicated. > > >> > > >> Here is the code with three timed options. As you can see, the first > two > > >> options are the fast one (multiplication with a scalar) and the slow > one > > >> (multiplication with a one by one matrix). In the third option I > tried > > >> to > > >> avoid the creation of an ad hoc temporary by allocating a variable > > >> outside > > >> of the loop. The effect unfortunately is nil. > > >> > > >> fs=[0.0]# Used only for option (3) > > >> # Now loop over all fes in the block > > >> for i=1:size(conns,1) > > >> > > >> ... > > >> for j=1:npts > > >> > > >>... > > >> > > >> # Option (1): 7.193767019 seconds (1648850568 bytes > > >> # Fe += Ns[j] * (f * Jac * w[j]); # > > >> > > >># Option (2): 17.301214583 seconds (3244458368 bytes > > >> > > >> #Fe += Ns[j] .* ([f] * Jac * w[j]); # > > >> > > >># Option (3): 16.943314075 seconds (3232879120 > > >> > > >> fs= ([f] * Jac * w[j]); Fe += Ns[j] .* fs; > > >> > > >> end > > >> ... > > >> > > >> end > > >> > > >> What do you think? Why is the code still getting hit with a big > > >> performance/memory penalty? > > >> > > >> Thanks, > > >> > > >> Petr > > >> > > >> On Monday, December 8, 2014 2:03:02 PM UTC-8, Valentin Churavy wrote: > > >>> I would think that when f is a 1x1 matrix Julia is allocating a new > 1x1 > > >>> matrix to store the result. If it is a scalar that allocation can be > > >>> skipped. When this part of the code is now in a hot loop it might > happen > > >>> that you allocate millions of very small short-lived objects and > that > > >>> taxes > > >>> the GC quite a lot. > >
Re: [julia-users] Re: Finite element code in Julia: Curious overhead in .* product
Can you post short but complete code examples in a gist? https://gist.github.com/ That would make it easier to follow than chasing code examples across long email chains. --Tim On Wednesday, December 10, 2014 02:35:38 PM Petr Krysl wrote: > I don't know if this is correct, but here is a guess: > > Option 3 still requires a temp array ( to calculate the result of the > paren fs= ([f] * Jac * w[j]); ), and option 4 eliminates that temp. The > cost of the temp over the 2 million loops is ~200MB and 0.6 sec CPU time. > So WHY is the difference between 1 and 2 so HUUUGE? > > I think this calls for someone who wrote the compiler. Guys? > > Thanks a bunch, > > P > > On Wednesday, December 10, 2014 12:51:26 PM UTC-8, Petr Krysl wrote: > > Actually: option (4) was also tested: > > # 16.333766821 seconds (3008899660 bytes > > fs[1]= f; fs *= (Jac * w[j]); > > > > Fe += Ns[j] .* fs; > > > > So, allocation of memory was reduced somewhat, runtime not so much. > > > > On Wednesday, December 10, 2014 12:45:20 PM UTC-8, Petr Krysl wrote: > >> Well, temporary array was also on my mind. However, things are I > >> believe a little bit more complicated. > >> > >> Here is the code with three timed options. As you can see, the first two > >> options are the fast one (multiplication with a scalar) and the slow one > >> (multiplication with a one by one matrix). In the third option I tried > >> to > >> avoid the creation of an ad hoc temporary by allocating a variable > >> outside > >> of the loop. The effect unfortunately is nil. > >> > >> fs=[0.0]# Used only for option (3) > >> # Now loop over all fes in the block > >> for i=1:size(conns,1) > >> > >> ... > >> for j=1:npts > >> > >>... > >> > >> # Option (1): 7.193767019 seconds (1648850568 bytes > >> # Fe += Ns[j] * (f * Jac * w[j]); # > >> > >># Option (2): 17.301214583 seconds (3244458368 bytes > >> > >> #Fe += Ns[j] .* ([f] * Jac * w[j]); # > >> > >># Option (3): 16.943314075 seconds (3232879120 > >> > >> fs= ([f] * Jac * w[j]); Fe += Ns[j] .* fs; > >> > >> end > >> ... > >> > >> end > >> > >> What do you think? Why is the code still getting hit with a big > >> performance/memory penalty? > >> > >> Thanks, > >> > >> Petr > >> > >> On Monday, December 8, 2014 2:03:02 PM UTC-8, Valentin Churavy wrote: > >>> I would think that when f is a 1x1 matrix Julia is allocating a new 1x1 > >>> matrix to store the result. If it is a scalar that allocation can be > >>> skipped. When this part of the code is now in a hot loop it might happen > >>> that you allocate millions of very small short-lived objects and that > >>> taxes > >>> the GC quite a lot.
[julia-users] Re: Finite element code in Julia: Curious overhead in .* product
I don't know if this is correct, but here is a guess: Option 3 still requires a temp array ( to calculate the result of the paren fs= ([f] * Jac * w[j]); ), and option 4 eliminates that temp. The cost of the temp over the 2 million loops is ~200MB and 0.6 sec CPU time. So WHY is the difference between 1 and 2 so HUUUGE? I think this calls for someone who wrote the compiler. Guys? Thanks a bunch, P On Wednesday, December 10, 2014 12:51:26 PM UTC-8, Petr Krysl wrote: > > Actually: option (4) was also tested: > # 16.333766821 seconds (3008899660 bytes > fs[1]= f; fs *= (Jac * w[j]); > Fe += Ns[j] .* fs; > > So, allocation of memory was reduced somewhat, runtime not so much. > > On Wednesday, December 10, 2014 12:45:20 PM UTC-8, Petr Krysl wrote: > >> Well, temporary array was also on my mind. However, things are I >> believe a little bit more complicated. >> >> Here is the code with three timed options. As you can see, the first two >> options are the fast one (multiplication with a scalar) and the slow one >> (multiplication with a one by one matrix). In the third option I tried to >> avoid the creation of an ad hoc temporary by allocating a variable outside >> of the loop. The effect unfortunately is nil. >> >> fs=[0.0]# Used only for option (3) >> # Now loop over all fes in the block >> for i=1:size(conns,1) >> ... >> for j=1:npts >>... >> # Option (1): 7.193767019 seconds (1648850568 bytes >> # Fe += Ns[j] * (f * Jac * w[j]); # >># Option (2): 17.301214583 seconds (3244458368 bytes >> #Fe += Ns[j] .* ([f] * Jac * w[j]); # >># Option (3): 16.943314075 seconds (3232879120 >> fs= ([f] * Jac * w[j]); Fe += Ns[j] .* fs; >> end >> ... >> end >> >> What do you think? Why is the code still getting hit with a big >> performance/memory penalty? >> >> Thanks, >> >> Petr >> >> On Monday, December 8, 2014 2:03:02 PM UTC-8, Valentin Churavy wrote: >> >>> I would think that when f is a 1x1 matrix Julia is allocating a new 1x1 >>> matrix to store the result. If it is a scalar that allocation can be >>> skipped. When this part of the code is now in a hot loop it might happen >>> that you allocate millions of very small short-lived objects and that taxes >>> the GC quite a lot. >>> >>>
[julia-users] Re: Finite element code in Julia: Curious overhead in .* product
Actually: option (4) was also tested: # 16.333766821 seconds (3008899660 bytes fs[1]= f; fs *= (Jac * w[j]); Fe += Ns[j] .* fs; So, allocation of memory was reduced somewhat, runtime not so much. On Wednesday, December 10, 2014 12:45:20 PM UTC-8, Petr Krysl wrote: > Well, temporary array was also on my mind. However, things are I > believe a little bit more complicated. > > Here is the code with three timed options. As you can see, the first two > options are the fast one (multiplication with a scalar) and the slow one > (multiplication with a one by one matrix). In the third option I tried to > avoid the creation of an ad hoc temporary by allocating a variable outside > of the loop. The effect unfortunately is nil. > > fs=[0.0]# Used only for option (3) > # Now loop over all fes in the block > for i=1:size(conns,1) > ... > for j=1:npts >... > # Option (1): 7.193767019 seconds (1648850568 bytes > # Fe += Ns[j] * (f * Jac * w[j]); # ># Option (2): 17.301214583 seconds (3244458368 bytes > #Fe += Ns[j] .* ([f] * Jac * w[j]); # ># Option (3): 16.943314075 seconds (3232879120 > fs= ([f] * Jac * w[j]); Fe += Ns[j] .* fs; > end > ... > end > > What do you think? Why is the code still getting hit with a big > performance/memory penalty? > > Thanks, > > Petr > > On Monday, December 8, 2014 2:03:02 PM UTC-8, Valentin Churavy wrote: > >> I would think that when f is a 1x1 matrix Julia is allocating a new 1x1 >> matrix to store the result. If it is a scalar that allocation can be >> skipped. When this part of the code is now in a hot loop it might happen >> that you allocate millions of very small short-lived objects and that taxes >> the GC quite a lot. >> >> >>>
[julia-users] Re: Finite element code in Julia: Curious overhead in .* product
Well, temporary array was also on my mind. However, things are I believe a little bit more complicated. Here is the code with three timed options. As you can see, the first two options are the fast one (multiplication with a scalar) and the slow one (multiplication with a one by one matrix). In the third option I tried to avoid the creation of an ad hoc temporary by allocating a variable outside of the loop. The effect unfortunately is nil. fs=[0.0]# Used only for option (3) # Now loop over all fes in the block for i=1:size(conns,1) ... for j=1:npts ... # Option (1): 7.193767019 seconds (1648850568 bytes # Fe += Ns[j] * (f * Jac * w[j]); # # Option (2): 17.301214583 seconds (3244458368 bytes #Fe += Ns[j] .* ([f] * Jac * w[j]); # # Option (3): 16.943314075 seconds (3232879120 fs= ([f] * Jac * w[j]); Fe += Ns[j] .* fs; end ... end What do you think? Why is the code still getting hit with a big performance/memory penalty? Thanks, Petr On Monday, December 8, 2014 2:03:02 PM UTC-8, Valentin Churavy wrote: > I would think that when f is a 1x1 matrix Julia is allocating a new 1x1 > matrix to store the result. If it is a scalar that allocation can be > skipped. When this part of the code is now in a hot loop it might happen > that you allocate millions of very small short-lived objects and that taxes > the GC quite a lot. > > >>
[julia-users] Re: Finite element code in Julia: Curious overhead in .* product
Although... looking closer at my suggestion, it appears to leak. Anyone know what I'm missing? On Tuesday, December 9, 2014 7:31:02 AM UTC+1, Robert Gates wrote: > > Hi Petr, > > I wrote you an email. > > Although I haven't seen your Julia implementation, I believe you could cut > down on GC time by running the GC at regular intervals only. For example, > when I loop over something, I usually choose to purge memory every 20Mb, > e.g. every 1000 iterations. That way, I have experienced 10x speed-ups for > GC heavy problems. For example: > https://gist.github.com/rleegates/c305c0fec0b963257cba > > Best regards, > > Robert > > On Monday, December 8, 2014 9:09:19 PM UTC+1, Petr Krysl wrote: >> >> I posted a message yesterday about a simple port of fragments a finite >> element code to solve the heat conduction equation. The Julia port was >> compared to the original Matlab toolkit FinEALE and the Julia code >> presented last year by Amuthan: >> >> >> https://groups.google.com/forum/?fromgroups=#!searchin/julia-users/Krysl%7Csort:relevance/julia-users/3tTljDSQ6cs/-8UCPnNmzn4J >> >> I have now speeded the code up some more, so that we have the table (on >> my laptop, i7, 16 gig of memory): >> >> Amuthan's 29 seconds >> J FinEALE 58 seconds >> FinEALE 810 seconds >> >> So, given that Amuthan reports to be slower by a general-purpose C++ code >> by a factor of around 1.36, J FinEALE is presumably slower with respect to >> an equivalent FE solver coded in C++ by a factor of 2.7. So far so good. >> >> The curious thing is that @time reports huge amounts of memory allocated >> (something like 10% of GC time). >> One particular source of wild memory allocation was this line (executed >> in this case 2 million times) >> >> Fe += Ns[j] .* (f * Jac * w[j]); >> >> where >> Fe = 3 x 1 matrix >> Ns[j] = 3 by one matrix >> f = one by one matrix >> Jac, w[j]= scalars >> >> The cost of the operation that encloses this line (among many others): >> 19.835263928 seconds (4162094480 bytes allocated, 16.79% gc time >> >> Changing the one by one matrix f into a scalar (and replacing .*) >> >> Fe += Ns[j] * (f * Jac * w[j]); >> >> changed the cost quite drastically: >> 10.105620394 seconds (2738120272 bytes allocated, 21.33% gc time >> >> Any ideas, you Julian wizards? >> >> Thanks, >> >> Petr >> >
[julia-users] Re: Finite element code in Julia: Curious overhead in .* product
Hi Petr, I wrote you an email. Although I haven't seen your Julia implementation, I believe you could cut down on GC time by running the GC at regular intervals only. For example, when I loop over something, I usually choose to purge memory every 20Mb, e.g. every 1000 iterations. That way, I have experienced 10x speed-ups for GC heavy problems. For example: https://gist.github.com/rleegates/c305c0fec0b963257cba Best regards, Robert On Monday, December 8, 2014 9:09:19 PM UTC+1, Petr Krysl wrote: > > I posted a message yesterday about a simple port of fragments a finite > element code to solve the heat conduction equation. The Julia port was > compared to the original Matlab toolkit FinEALE and the Julia code > presented last year by Amuthan: > > > https://groups.google.com/forum/?fromgroups=#!searchin/julia-users/Krysl%7Csort:relevance/julia-users/3tTljDSQ6cs/-8UCPnNmzn4J > > I have now speeded the code up some more, so that we have the table (on my > laptop, i7, 16 gig of memory): > > Amuthan's 29 seconds > J FinEALE 58 seconds > FinEALE 810 seconds > > So, given that Amuthan reports to be slower by a general-purpose C++ code > by a factor of around 1.36, J FinEALE is presumably slower with respect to > an equivalent FE solver coded in C++ by a factor of 2.7. So far so good. > > The curious thing is that @time reports huge amounts of memory allocated > (something like 10% of GC time). > One particular source of wild memory allocation was this line (executed in > this case 2 million times) > > Fe += Ns[j] .* (f * Jac * w[j]); > > where > Fe = 3 x 1 matrix > Ns[j] = 3 by one matrix > f = one by one matrix > Jac, w[j]= scalars > > The cost of the operation that encloses this line (among many others): > 19.835263928 seconds (4162094480 bytes allocated, 16.79% gc time > > Changing the one by one matrix f into a scalar (and replacing .*) > > Fe += Ns[j] * (f * Jac * w[j]); > > changed the cost quite drastically: > 10.105620394 seconds (2738120272 bytes allocated, 21.33% gc time > > Any ideas, you Julian wizards? > > Thanks, > > Petr >
[julia-users] Re: Finite element code in Julia: Curious overhead in .* product
Hi Robert, At this point I am really impressed with Julia. Consequently, I am tempted to rewrite my Matlab toolkit FinEALE (https://github.com/PetrKryslUCSD/FinEALE) in Julia and to implement further research ideas in this new environment. What did you have in mind? Petr pkr...@ucsd.edu On Monday, December 8, 2014 6:41:02 PM UTC-8, Robert Gates wrote: > > Hi Petr, > > just on a side-note: are you planning on implementing a complete FE > package in Julia? In that case we could pool our efforts. > > >
[julia-users] Re: Finite element code in Julia: Curious overhead in .* product
Hi Petr, just on a side-note: are you planning on implementing a complete FE package in Julia? In that case we could pool our efforts. Best regards, Robert On Monday, December 8, 2014 9:09:19 PM UTC+1, Petr Krysl wrote: > > I posted a message yesterday about a simple port of fragments a finite > element code to solve the heat conduction equation. The Julia port was > compared to the original Matlab toolkit FinEALE and the Julia code > presented last year by Amuthan: > > > https://groups.google.com/forum/?fromgroups=#!searchin/julia-users/Krysl%7Csort:relevance/julia-users/3tTljDSQ6cs/-8UCPnNmzn4J > > I have now speeded the code up some more, so that we have the table (on my > laptop, i7, 16 gig of memory): > > Amuthan's 29 seconds > J FinEALE 58 seconds > FinEALE 810 seconds > > So, given that Amuthan reports to be slower by a general-purpose C++ code > by a factor of around 1.36, J FinEALE is presumably slower with respect to > an equivalent FE solver coded in C++ by a factor of 2.7. So far so good. > > The curious thing is that @time reports huge amounts of memory allocated > (something like 10% of GC time). > One particular source of wild memory allocation was this line (executed in > this case 2 million times) > > Fe += Ns[j] .* (f * Jac * w[j]); > > where > Fe = 3 x 1 matrix > Ns[j] = 3 by one matrix > f = one by one matrix > Jac, w[j]= scalars > > The cost of the operation that encloses this line (among many others): > 19.835263928 seconds (4162094480 bytes allocated, 16.79% gc time > > Changing the one by one matrix f into a scalar (and replacing .*) > > Fe += Ns[j] * (f * Jac * w[j]); > > changed the cost quite drastically: > 10.105620394 seconds (2738120272 bytes allocated, 21.33% gc time > > Any ideas, you Julian wizards? > > Thanks, > > Petr >
[julia-users] Re: Finite element code in Julia: Curious overhead in .* product
I would think that when f is a 1x1 matrix Julia is allocating a new 1x1 matrix to store the result. If it is a scalar that allocation can be skipped. When this part of the code is now in a hot loop it might happen that you allocate millions of very small short-lived objects and that taxes the GC quite a lot. On Monday, 8 December 2014 21:09:19 UTC+1, Petr Krysl wrote: > > I posted a message yesterday about a simple port of fragments a finite > element code to solve the heat conduction equation. The Julia port was > compared to the original Matlab toolkit FinEALE and the Julia code > presented last year by Amuthan: > > > https://groups.google.com/forum/?fromgroups=#!searchin/julia-users/Krysl%7Csort:relevance/julia-users/3tTljDSQ6cs/-8UCPnNmzn4J > > I have now speeded the code up some more, so that we have the table (on my > laptop, i7, 16 gig of memory): > > Amuthan's 29 seconds > J FinEALE 58 seconds > FinEALE 810 seconds > > So, given that Amuthan reports to be slower by a general-purpose C++ code > by a factor of around 1.36, J FinEALE is presumably slower with respect to > an equivalent FE solver coded in C++ by a factor of 2.7. So far so good. > > The curious thing is that @time reports huge amounts of memory allocated > (something like 10% of GC time). > One particular source of wild memory allocation was this line (executed in > this case 2 million times) > > Fe += Ns[j] .* (f * Jac * w[j]); > > where > Fe = 3 x 1 matrix > Ns[j] = 3 by one matrix > f = one by one matrix > Jac, w[j]= scalars > > The cost of the operation that encloses this line (among many others): > 19.835263928 seconds (4162094480 bytes allocated, 16.79% gc time > > Changing the one by one matrix f into a scalar (and replacing .*) > > Fe += Ns[j] * (f * Jac * w[j]); > > changed the cost quite drastically: > 10.105620394 seconds (2738120272 bytes allocated, 21.33% gc time > > Any ideas, you Julian wizards? > > Thanks, > > Petr >