I don't know if this is correct, but here is a guess: Option 3 still requires a temp array ( to calculate the result of the paren fs= ([f] * Jac * w[j]); ), and option 4 eliminates that temp. The cost of the temp over the 2 million loops is ~200MB and 0.6 sec CPU time. So WHY is the difference between 1 and 2 so HUUUGE?
I think this calls for someone who wrote the compiler. Guys? Thanks a bunch, P On Wednesday, December 10, 2014 12:51:26 PM UTC-8, Petr Krysl wrote: > > Actually: option (4) was also tested: > # 16.333766821 seconds (3008899660 bytes > fs[1]= f; fs *= (Jac * w[j]); > Fe += Ns[j] .* fs; > > So, allocation of memory was reduced somewhat, runtime not so much. > > On Wednesday, December 10, 2014 12:45:20 PM UTC-8, Petr Krysl wrote: > >> Well, temporary array was also on my mind. However, things are I >> believe a little bit more complicated. >> >> Here is the code with three timed options. As you can see, the first two >> options are the fast one (multiplication with a scalar) and the slow one >> (multiplication with a one by one matrix). In the third option I tried to >> avoid the creation of an ad hoc temporary by allocating a variable outside >> of the loop. The effect unfortunately is nil. >> >> fs=[0.0]# Used only for option (3) >> # Now loop over all fes in the block >> for i=1:size(conns,1) >> ... >> for j=1:npts >> ... >> # Option (1): 7.193767019 seconds (1648850568 bytes >> # Fe += Ns[j] * (f * Jac * w[j]); # >> # Option (2): 17.301214583 seconds (3244458368 bytes >> # Fe += Ns[j] .* ([f] * Jac * w[j]); # >> # Option (3): 16.943314075 seconds (3232879120 >> fs= ([f] * Jac * w[j]); Fe += Ns[j] .* fs; >> end >> ... >> end >> >> What do you think? Why is the code still getting hit with a big >> performance/memory penalty? >> >> Thanks, >> >> Petr >> >> On Monday, December 8, 2014 2:03:02 PM UTC-8, Valentin Churavy wrote: >> >>> I would think that when f is a 1x1 matrix Julia is allocating a new 1x1 >>> matrix to store the result. If it is a scalar that allocation can be >>> skipped. When this part of the code is now in a hot loop it might happen >>> that you allocate millions of very small short-lived objects and that taxes >>> the GC quite a lot. >>> >>> >>>>