Re: [julia-users] Re: Finite element code in Julia: Curious overhead in .* product

2014-12-11 Thread Petr Krysl
Valentin,

Thank you so much for this valuable information. 

I would define decent, as does the Julia community with respect to Julia 
itself, as being within a small factor of the best performance.  As Tim 
Holy demonstrated with his example, these two implementations run about a 
factor of 30 apart:

Fe += Ns * (f * Jac); #30 times slower than

for k=1:length(Fe)
  Fe[k]+=Ns[k]*(f * Jac);#this expanded loop
end
   
So that seems like a lot.  As you point out, one might be willing to write 
it like this if to decide between two pains (losing a readable code, or 
losing  a lot of performance) meant that loops win by a lot.

Petr

On Thursday, December 11, 2014 12:18:15 AM UTC-8, Valentin Churavy wrote:
>
> Dear Petr,
>
> I tought the point of having the array objects and the associated 
>> manipulation functions was to hide the loops while delivering decent 
>> performance...
>
>
> This assumption is very true in Matlab. Matlab spend an enormous 
> engineering amount on optimizing code that uses vectorization and array 
> objects (and it is still slow). There has been ongoing discussions over on 
> Github [1] on how to improve the current situation in Julia. One of the 
> major selling points of Julia for me is the fact that it is quite 
> transparent on which kind of optimizations it requires. I can write in a 
> very dynamic style with a lot of matrix operations like in Matlab and still 
> get decent performance or I can go in and identify with tools like @profile 
> [2] what the pains point in my program are. 
>
> The point of using vectorized operations in Matlab is that is the one 
> reliable way to get good performance in matlab, because all underlying 
> functions are written in C. In Julia most underlying functions are written 
> in Julia (note that most mathematical operations call out to C libraries. 
> No need to reinvent the wheel). There is a Julia package you can use for 
> porting over Matlab code that devectorizes you code [3], but if the 
> operation is occurring in a hot loop it still will be a good and necessary 
> optimization to unroll the vectorized code. Then you no longer get just 
> decent performance, but excellent performance instead. 
>
> Best Valentin
>
> PS:
> You can also use @which 1 .* [1] and @edit to see where the operation is 
> defined and how it is implemented. The .* operation is implemented by 
> allocating an output array and the running * element wise over the array. 
> No magic is going on there.
>
> [1] https://github.com/JuliaLang/julia/issues/8450 
> [2] http://docs.julialang.org/en/release-0.3/stdlib/profile/
> [3] https://github.com/lindahua/Devectorize.jl
>
> On Thursday, 11 December 2014 06:23:29 UTC+1, Petr Krysl wrote:
>>
>> Thanks.  Now my head is really spinning!
>>
>> See, before I posted the original question I tried expanding the loop in 
>> the actual FE code, and the code was SLOWER and was using MORE memory:
>>
>> With the expression 
>> Fe += Ns[j] * (f * Jac * w[j]); :
>> 6.223416655 seconds (1648832052 bytes
>>
>> With the expanded loop 
>> for kx=1:length(Fe) # alternative (devectorization)
>>  Fe[kx] += Ns[j][kx] * (f * Jac * w[j]); 
>> end 
>>  7.340272676 seconds (1776971604 bytes allocated,
>>
>> In addition, your argument clearly demonstrates how to avoid the 
>> temporary array for doit1(), but doit2() adds to the 3 x 1 one additional 
>> 1x1 temporary (it seems to me), yet it is about 14 times slower. Why is 
>> that?
>>
>> Finally, if the only way I can get decent performance with lines like
>>
>>  Fe += Ns[j] * (f * Jac * w[j]);  # Fe and Ns[j] arrays
>>
>> is to manually write out all the loops, that would be terrible news 
>> indeed.  Not only that is a lot of work when rewriting loads of Matlab code 
>> (with several matrices concatenated in many many expressions), but the 
>> legibility and maintainability tanks. I tought the point of having the 
>> array objects and the associated manipulation functions was to hide the 
>> loops while delivering decent performance...
>>
>> Petr
>>
>>
>> On Wednesday, December 10, 2014 8:28:31 PM UTC-8, Tim Holy wrote:
>>>
>>> Multiplying two Float64s yields another Float64; most likely, this will 
>>> be 
>>> stored in the CPU's registers. In contrast,  [f]*Jac creates an array, 
>>> on each 
>>> iteration, that has to be stored on the heap. 
>>>
>>> A faster variant devectorizes: 
>>>function doit1a(N) 
>>>Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; 
>>>for i=1:N 
>>>tmp = f*Jac 
>>>for j = 1:length(Fe) 
>>>Fe[j] += Ns[j]*tmp 
>>>end 
>>>end 
>>>Fe 
>>>end 
>>>
>>> julia> @time doit1(N); 
>>> elapsed time: 0.810270399 seconds (384000320 bytes allocated, 61.23% gc 
>>> time) 
>>>
>>> julia> @time doit1a(N); 
>>> elapsed time: 0.022118726 seconds (320 bytes allocated) 
>>>
>>> Note the tiny allocations in the second case. 
>>>
>>> --Tim 
>>>
>

Re: [julia-users] Re: Finite element code in Julia: Curious overhead in .* product

2014-12-11 Thread Valentin Churavy
Dear Petr,

I tought the point of having the array objects and the associated 
> manipulation functions was to hide the loops while delivering decent 
> performance...


This assumption is very true in Matlab. Matlab spend an enormous 
engineering amount on optimizing code that uses vectorization and array 
objects (and it is still slow). There has been ongoing discussions over on 
Github [1] on how to improve the current situation in Julia. One of the 
major selling points of Julia for me is the fact that it is quite 
transparent on which kind of optimizations it requires. I can write in a 
very dynamic style with a lot of matrix operations like in Matlab and still 
get decent performance or I can go in and identify with tools like @profile 
[2] what the pains point in my program are. 

The point of using vectorized operations in Matlab is that is the one 
reliable way to get good performance in matlab, because all underlying 
functions are written in C. In Julia most underlying functions are written 
in Julia (note that most mathematical operations call out to C libraries. 
No need to reinvent the wheel). There is a Julia package you can use for 
porting over Matlab code that devectorizes you code [3], but if the 
operation is occurring in a hot loop it still will be a good and necessary 
optimization to unroll the vectorized code. Then you no longer get just 
decent performance, but excellent performance instead. 

Best Valentin

PS:
You can also use @which 1 .* [1] and @edit to see where the operation is 
defined and how it is implemented. The .* operation is implemented by 
allocating an output array and the running * element wise over the array. 
No magic is going on there.

[1] https://github.com/JuliaLang/julia/issues/8450 
[2] http://docs.julialang.org/en/release-0.3/stdlib/profile/
[3] https://github.com/lindahua/Devectorize.jl

On Thursday, 11 December 2014 06:23:29 UTC+1, Petr Krysl wrote:
>
> Thanks.  Now my head is really spinning!
>
> See, before I posted the original question I tried expanding the loop in 
> the actual FE code, and the code was SLOWER and was using MORE memory:
>
> With the expression 
> Fe += Ns[j] * (f * Jac * w[j]); :
> 6.223416655 seconds (1648832052 bytes
>
> With the expanded loop 
> for kx=1:length(Fe) # alternative (devectorization)
>  Fe[kx] += Ns[j][kx] * (f * Jac * w[j]); 
> end 
>  7.340272676 seconds (1776971604 bytes allocated,
>
> In addition, your argument clearly demonstrates how to avoid the temporary 
> array for doit1(), but doit2() adds to the 3 x 1 one additional 1x1 
> temporary (it seems to me), yet it is about 14 times slower. Why is that?
>
> Finally, if the only way I can get decent performance with lines like
>
>  Fe += Ns[j] * (f * Jac * w[j]);  # Fe and Ns[j] arrays
>
> is to manually write out all the loops, that would be terrible news 
> indeed.  Not only that is a lot of work when rewriting loads of Matlab code 
> (with several matrices concatenated in many many expressions), but the 
> legibility and maintainability tanks. I tought the point of having the 
> array objects and the associated manipulation functions was to hide the 
> loops while delivering decent performance...
>
> Petr
>
>
> On Wednesday, December 10, 2014 8:28:31 PM UTC-8, Tim Holy wrote:
>>
>> Multiplying two Float64s yields another Float64; most likely, this will 
>> be 
>> stored in the CPU's registers. In contrast,  [f]*Jac creates an array, on 
>> each 
>> iteration, that has to be stored on the heap. 
>>
>> A faster variant devectorizes: 
>>function doit1a(N) 
>>Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; 
>>for i=1:N 
>>tmp = f*Jac 
>>for j = 1:length(Fe) 
>>Fe[j] += Ns[j]*tmp 
>>end 
>>end 
>>Fe 
>>end 
>>
>> julia> @time doit1(N); 
>> elapsed time: 0.810270399 seconds (384000320 bytes allocated, 61.23% gc 
>> time) 
>>
>> julia> @time doit1a(N); 
>> elapsed time: 0.022118726 seconds (320 bytes allocated) 
>>
>> Note the tiny allocations in the second case. 
>>
>> --Tim 
>>
>>
>> On Wednesday, December 10, 2014 07:54:00 PM Petr Krysl wrote: 
>> > The code is really short: 
>> > 
>> > N=200 
>> > 
>> > function doit1(N) 
>> > Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; 
>> > for i=1:N 
>> > Fe += Ns *  (f * Jac); 
>> > end 
>> > Fe 
>> > end 
>> > 
>> > function doit2(N) 
>> > Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; 
>> > for i=1:N 
>> > Fe += Ns .*  ([f] * Jac); 
>> > end 
>> > Fe 
>> > end 
>> > 
>> > function doit3(N) 
>> > Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; 
>> > fs=[1.0] 
>> > for i=1:N 
>> > fs= ([f] * Jac );   Fe += Ns .*  fs; 
>> > end 
>> > Fe 
>> > end 
>> > 
>> > function doit4(N) 
>> > Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; 
>> > fs=[1.0] 

Re: [julia-users] Re: Finite element code in Julia: Curious overhead in .* product

2014-12-10 Thread Petr Krysl
Thanks.  Now my head is really spinning!

See, before I posted the original question I tried expanding the loop in 
the actual FE code, and the code was SLOWER and was using MORE memory:

With the expression 
Fe += Ns[j] * (f * Jac * w[j]); :
6.223416655 seconds (1648832052 bytes

With the expanded loop 
for kx=1:length(Fe) # alternative (devectorization)
 Fe[kx] += Ns[j][kx] * (f * Jac * w[j]); 
end 
 7.340272676 seconds (1776971604 bytes allocated,

In addition, your argument clearly demonstrates how to avoid the temporary 
array for doit1(), but doit2() adds to the 3 x 1 one additional 1x1 
temporary (it seems to me), yet it is about 14 times slower. Why is that?

Finally, if the only way I can get decent performance with lines like

 Fe += Ns[j] * (f * Jac * w[j]);  # Fe and Ns[j] arrays

is to manually write out all the loops, that would be terrible news indeed. 
 Not only that is a lot of work when rewriting loads of Matlab code (with 
several matrices concatenated in many many expressions), but the legibility 
and maintainability tanks. I tought the point of having the array objects 
and the associated manipulation functions was to hide the loops while 
delivering decent performance...

Petr


On Wednesday, December 10, 2014 8:28:31 PM UTC-8, Tim Holy wrote:
>
> Multiplying two Float64s yields another Float64; most likely, this will be 
> stored in the CPU's registers. In contrast,  [f]*Jac creates an array, on 
> each 
> iteration, that has to be stored on the heap. 
>
> A faster variant devectorizes: 
>function doit1a(N) 
>Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; 
>for i=1:N 
>tmp = f*Jac 
>for j = 1:length(Fe) 
>Fe[j] += Ns[j]*tmp 
>end 
>end 
>Fe 
>end 
>
> julia> @time doit1(N); 
> elapsed time: 0.810270399 seconds (384000320 bytes allocated, 61.23% gc 
> time) 
>
> julia> @time doit1a(N); 
> elapsed time: 0.022118726 seconds (320 bytes allocated) 
>
> Note the tiny allocations in the second case. 
>
> --Tim 
>
>
> On Wednesday, December 10, 2014 07:54:00 PM Petr Krysl wrote: 
> > The code is really short: 
> > 
> > N=200 
> > 
> > function doit1(N) 
> > Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; 
> > for i=1:N 
> > Fe += Ns *  (f * Jac); 
> > end 
> > Fe 
> > end 
> > 
> > function doit2(N) 
> > Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; 
> > for i=1:N 
> > Fe += Ns .*  ([f] * Jac); 
> > end 
> > Fe 
> > end 
> > 
> > function doit3(N) 
> > Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; 
> > fs=[1.0] 
> > for i=1:N 
> > fs= ([f] * Jac );   Fe += Ns .*  fs; 
> > end 
> > Fe 
> > end 
> > 
> > function doit4(N) 
> > Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0; 
> > fs=[1.0] 
> > for i=1:N   # 
> > fs= [f]; fs *= Jac;   Fe += Ns .*  fs; 
> > end 
> > Fe 
> > end 
> > # 
> > @time doit1(N) 
> > @time doit2(N) 
> > @time doit3(N) 
> > @time doit4(N) 
> > 
> > Here are the measurements on my machine: 
> > 
> > elapsed time: 0.461173619 seconds (384129944 bytes allocated, 44.45% gc 
> > time) 
> > elapsed time: 6.905249901 seconds (1904151036 bytes allocated, 12.84% gc 
> > time) 
> > elapsed time: 6.862259146 seconds (1904166072 bytes allocated, 13.23% gc 
> > time) 
> > elapsed time: 6.842678542 seconds (1904166256 bytes allocated, 12.81% gc 
> > time) 
> > 
> > I'll be grateful for any pointers as to how to structure the code so 
> that 
> > the array operations not incur this horrendous penalty. 
> > 
> > Petr 
> > 
> > On Wednesday, December 10, 2014 7:06:43 PM UTC-8, Tim Holy wrote: 
> > > Can you post short but complete code examples in a gist? 
> > > https://gist.github.com/ 
> > > That would make it easier to follow than chasing code examples across 
> long 
> > > email chains. 
> > > 
> > > --Tim 
> > > 
> > > On Wednesday, December 10, 2014 02:35:38 PM Petr Krysl wrote: 
> > > > I don't know if this is correct, but here is a guess: 
> > > > 
> > > > Option 3 still requires a temp array ( to calculate the result of 
> the 
> > > > paren fs= ([f] * Jac * w[j]); ), and option 4 eliminates that temp. 
> The 
> > > > cost of the temp over the 2 million loops is ~200MB and 0.6 sec CPU 
> > > 
> > > time. 
> > > 
> > > > So WHY is the difference between 1 and 2 so HUUUGE? 
> > > > 
> > > > I think this calls for someone who wrote the compiler. Guys? 
> > > > 
> > > > Thanks a bunch, 
> > > > 
> > > > P 
> > > > 
> > > > On Wednesday, December 10, 2014 12:51:26 PM UTC-8, Petr Krysl wrote: 
> > > > > Actually: option (4) was also tested: 
> > > > > # 16.333766821 seconds (3008899660 bytes 
> > > > > fs[1]= f; fs *= (Jac * w[j]); 
> > > > > 
> > > > > Fe += Ns[j] .*  fs; 
> > > > > 
> > > > > So, allocation of memory was reduced somewhat, runtime not so

Re: [julia-users] Re: Finite element code in Julia: Curious overhead in .* product

2014-12-10 Thread Tim Holy
Multiplying two Float64s yields another Float64; most likely, this will be 
stored in the CPU's registers. In contrast,  [f]*Jac creates an array, on each 
iteration, that has to be stored on the heap.

A faster variant devectorizes:
   function doit1a(N)
   Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0;
   for i=1:N
   tmp = f*Jac
   for j = 1:length(Fe)
   Fe[j] += Ns[j]*tmp
   end
   end
   Fe
   end

julia> @time doit1(N);
elapsed time: 0.810270399 seconds (384000320 bytes allocated, 61.23% gc time)

julia> @time doit1a(N);
elapsed time: 0.022118726 seconds (320 bytes allocated)

Note the tiny allocations in the second case.

--Tim


On Wednesday, December 10, 2014 07:54:00 PM Petr Krysl wrote:
> The code is really short:
> 
> N=200
> 
> function doit1(N)
> Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0;
> for i=1:N
> Fe += Ns *  (f * Jac);
> end
> Fe
> end
> 
> function doit2(N)
> Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0;
> for i=1:N
> Fe += Ns .*  ([f] * Jac);
> end
> Fe
> end
> 
> function doit3(N)
> Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0;
> fs=[1.0]
> for i=1:N
> fs= ([f] * Jac );   Fe += Ns .*  fs;
> end
> Fe
> end
> 
> function doit4(N)
> Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0;
> fs=[1.0]
> for i=1:N   #
> fs= [f]; fs *= Jac;   Fe += Ns .*  fs;
> end
> Fe
> end
> #
> @time doit1(N)
> @time doit2(N)
> @time doit3(N)
> @time doit4(N)
> 
> Here are the measurements on my machine:
> 
> elapsed time: 0.461173619 seconds (384129944 bytes allocated, 44.45% gc
> time)
> elapsed time: 6.905249901 seconds (1904151036 bytes allocated, 12.84% gc
> time)
> elapsed time: 6.862259146 seconds (1904166072 bytes allocated, 13.23% gc
> time)
> elapsed time: 6.842678542 seconds (1904166256 bytes allocated, 12.81% gc
> time)
> 
> I'll be grateful for any pointers as to how to structure the code so that
> the array operations not incur this horrendous penalty.
> 
> Petr
> 
> On Wednesday, December 10, 2014 7:06:43 PM UTC-8, Tim Holy wrote:
> > Can you post short but complete code examples in a gist?
> > https://gist.github.com/
> > That would make it easier to follow than chasing code examples across long
> > email chains.
> > 
> > --Tim
> > 
> > On Wednesday, December 10, 2014 02:35:38 PM Petr Krysl wrote:
> > > I don't know if this is correct, but here is a guess:
> > > 
> > > Option 3 still requires a temp array ( to calculate the result of the
> > > paren fs= ([f] * Jac * w[j]); ), and option 4 eliminates that temp. The
> > > cost of the temp over the 2 million loops is ~200MB and 0.6 sec CPU
> > 
> > time.
> > 
> > > So WHY is the difference between 1 and 2 so HUUUGE?
> > > 
> > > I think this calls for someone who wrote the compiler. Guys?
> > > 
> > > Thanks a bunch,
> > > 
> > > P
> > > 
> > > On Wednesday, December 10, 2014 12:51:26 PM UTC-8, Petr Krysl wrote:
> > > > Actually: option (4) was also tested:
> > > > # 16.333766821 seconds (3008899660 bytes
> > > > fs[1]= f; fs *= (Jac * w[j]);
> > > > 
> > > > Fe += Ns[j] .*  fs;
> > > > 
> > > > So, allocation of memory was reduced somewhat, runtime not so much.
> > > > 
> > > > On Wednesday, December 10, 2014 12:45:20 PM UTC-8, Petr Krysl wrote:
> > > >> Well,  temporary array was also on my mind.  However,  things are I
> > > >> believe a little bit more complicated.
> > > >> 
> > > >> Here is the code with three timed options.  As you can see, the first
> > 
> > two
> > 
> > > >> options are the fast one (multiplication with a scalar) and the slow
> > 
> > one
> > 
> > > >> (multiplication with a one by one matrix).   In the third option I
> > 
> > tried
> > 
> > > >> to
> > > >> avoid the creation of an  ad hoc temporary by allocating a variable
> > > >> outside
> > > >> of the loop.  The effect unfortunately is nil.
> > > >> 
> > > >> fs=[0.0]# Used only for option (3)
> > > >> # Now loop over all fes in the block
> > > >> for i=1:size(conns,1)
> > > >> 
> > > >> ...
> > > >> for j=1:npts
> > > >> 
> > > >>...
> > > >>   
> > > >>   # Option (1): 7.193767019 seconds (1648850568 bytes
> > > >>   # Fe += Ns[j] *  (f * Jac * w[j]); #
> > > >>   
> > > >># Option (2): 17.301214583 seconds (3244458368 bytes
> > > >>   
> > > >>   #Fe += Ns[j] .*  ([f] * Jac * w[j]); #
> > > >>   
> > > >># Option (3): 16.943314075 seconds (3232879120
> > > >>   
> > > >>   fs= ([f] * Jac * w[j]);   Fe += Ns[j] .*  fs;
> > > >> 
> > > >> end
> > > >> ...
> > > >> 
> > > >> end
> > > >> 
> > > >> What do you think?   Why is the code still getting hit with a big
> > > >> performance/memory penal

Re: [julia-users] Re: Finite element code in Julia: Curious overhead in .* product

2014-12-10 Thread Petr Krysl
https://gist.github.com/anonymous/054f6fd269f97a4442db

On Wednesday, December 10, 2014 7:06:43 PM UTC-8, Tim Holy wrote:
>
> Can you post short but complete code examples in a gist? 
> https://gist.github.com/ 
> That would make it easier to follow than chasing code examples across long 
> email chains. 
>
> --Tim 
>
>
> On Wednesday, December 10, 2014 02:35:38 PM Petr Krysl wrote: 
> > I don't know if this is correct, but here is a guess: 
> > 
> > Option 3 still requires a temp array ( to calculate the result of the 
> > paren fs= ([f] * Jac * w[j]); ), and option 4 eliminates that temp. The 
> > cost of the temp over the 2 million loops is ~200MB and 0.6 sec CPU 
> time. 
> > So WHY is the difference between 1 and 2 so HUUUGE? 
> > 
> > I think this calls for someone who wrote the compiler. Guys? 
> > 
> > Thanks a bunch, 
> > 
> > P 
> > 
> > On Wednesday, December 10, 2014 12:51:26 PM UTC-8, Petr Krysl wrote: 
> > > Actually: option (4) was also tested: 
> > > # 16.333766821 seconds (3008899660 bytes 
> > > fs[1]= f; fs *= (Jac * w[j]); 
> > > 
> > > Fe += Ns[j] .*  fs; 
> > > 
> > > So, allocation of memory was reduced somewhat, runtime not so much. 
> > > 
> > > On Wednesday, December 10, 2014 12:45:20 PM UTC-8, Petr Krysl wrote: 
> > >> Well,  temporary array was also on my mind.  However,  things are I 
> > >> believe a little bit more complicated. 
> > >> 
> > >> Here is the code with three timed options.  As you can see, the first 
> two 
> > >> options are the fast one (multiplication with a scalar) and the slow 
> one 
> > >> (multiplication with a one by one matrix).   In the third option I 
> tried 
> > >> to 
> > >> avoid the creation of an  ad hoc temporary by allocating a variable 
> > >> outside 
> > >> of the loop.  The effect unfortunately is nil. 
> > >> 
> > >> fs=[0.0]# Used only for option (3) 
> > >> # Now loop over all fes in the block 
> > >> for i=1:size(conns,1) 
> > >> 
> > >> ... 
> > >> for j=1:npts 
> > >> 
> > >>... 
> > >>   
> > >>   # Option (1): 7.193767019 seconds (1648850568 bytes 
> > >>   # Fe += Ns[j] *  (f * Jac * w[j]); # 
> > >>   
> > >># Option (2): 17.301214583 seconds (3244458368 bytes 
> > >>   
> > >>   #Fe += Ns[j] .*  ([f] * Jac * w[j]); # 
> > >>   
> > >># Option (3): 16.943314075 seconds (3232879120 
> > >>   
> > >>   fs= ([f] * Jac * w[j]);   Fe += Ns[j] .*  fs; 
> > >> 
> > >> end 
> > >> ... 
> > >> 
> > >> end 
> > >> 
> > >> What do you think?   Why is the code still getting hit with a big 
> > >> performance/memory penalty? 
> > >> 
> > >> Thanks, 
> > >> 
> > >> Petr 
> > >> 
> > >> On Monday, December 8, 2014 2:03:02 PM UTC-8, Valentin Churavy wrote: 
> > >>> I would think that when f is a 1x1 matrix Julia is allocating a new 
> 1x1 
> > >>> matrix to store the result. If it is a scalar that allocation can be 
> > >>> skipped. When this part of the code is now in a hot loop it might 
> happen 
> > >>> that you allocate millions of very small short-lived objects and 
> that 
> > >>> taxes 
> > >>> the GC quite a lot. 
>
>

Re: [julia-users] Re: Finite element code in Julia: Curious overhead in .* product

2014-12-10 Thread Petr Krysl
The code is really short:

N=200

function doit1(N)
Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0;
for i=1:N
Fe += Ns *  (f * Jac); 
end
Fe
end

function doit2(N)
Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0;
for i=1:N
Fe += Ns .*  ([f] * Jac); 
end
Fe
end

function doit3(N)
Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0;
fs=[1.0]
for i=1:N
fs= ([f] * Jac );   Fe += Ns .*  fs; 
end
Fe
end

function doit4(N)
Fe=zeros(3,1);Ns=zeros(3,1)+1.0;f=-6.0;Jac=1.0;
fs=[1.0]
for i=1:N   # 
fs= [f]; fs *= Jac;   Fe += Ns .*  fs; 
end
Fe
end
# 
@time doit1(N)
@time doit2(N)
@time doit3(N)
@time doit4(N)

Here are the measurements on my machine:

elapsed time: 0.461173619 seconds (384129944 bytes allocated, 44.45% gc 
time)
elapsed time: 6.905249901 seconds (1904151036 bytes allocated, 12.84% gc 
time)
elapsed time: 6.862259146 seconds (1904166072 bytes allocated, 13.23% gc 
time)
elapsed time: 6.842678542 seconds (1904166256 bytes allocated, 12.81% gc 
time)

I'll be grateful for any pointers as to how to structure the code so that 
the array operations not incur this horrendous penalty.

Petr

On Wednesday, December 10, 2014 7:06:43 PM UTC-8, Tim Holy wrote:
>
> Can you post short but complete code examples in a gist? 
> https://gist.github.com/ 
> That would make it easier to follow than chasing code examples across long 
> email chains. 
>
> --Tim 
>
>
> On Wednesday, December 10, 2014 02:35:38 PM Petr Krysl wrote: 
> > I don't know if this is correct, but here is a guess: 
> > 
> > Option 3 still requires a temp array ( to calculate the result of the 
> > paren fs= ([f] * Jac * w[j]); ), and option 4 eliminates that temp. The 
> > cost of the temp over the 2 million loops is ~200MB and 0.6 sec CPU 
> time. 
> > So WHY is the difference between 1 and 2 so HUUUGE? 
> > 
> > I think this calls for someone who wrote the compiler. Guys? 
> > 
> > Thanks a bunch, 
> > 
> > P 
> > 
> > On Wednesday, December 10, 2014 12:51:26 PM UTC-8, Petr Krysl wrote: 
> > > Actually: option (4) was also tested: 
> > > # 16.333766821 seconds (3008899660 bytes 
> > > fs[1]= f; fs *= (Jac * w[j]); 
> > > 
> > > Fe += Ns[j] .*  fs; 
> > > 
> > > So, allocation of memory was reduced somewhat, runtime not so much. 
> > > 
> > > On Wednesday, December 10, 2014 12:45:20 PM UTC-8, Petr Krysl wrote: 
> > >> Well,  temporary array was also on my mind.  However,  things are I 
> > >> believe a little bit more complicated. 
> > >> 
> > >> Here is the code with three timed options.  As you can see, the first 
> two 
> > >> options are the fast one (multiplication with a scalar) and the slow 
> one 
> > >> (multiplication with a one by one matrix).   In the third option I 
> tried 
> > >> to 
> > >> avoid the creation of an  ad hoc temporary by allocating a variable 
> > >> outside 
> > >> of the loop.  The effect unfortunately is nil. 
> > >> 
> > >> fs=[0.0]# Used only for option (3) 
> > >> # Now loop over all fes in the block 
> > >> for i=1:size(conns,1) 
> > >> 
> > >> ... 
> > >> for j=1:npts 
> > >> 
> > >>... 
> > >>   
> > >>   # Option (1): 7.193767019 seconds (1648850568 bytes 
> > >>   # Fe += Ns[j] *  (f * Jac * w[j]); # 
> > >>   
> > >># Option (2): 17.301214583 seconds (3244458368 bytes 
> > >>   
> > >>   #Fe += Ns[j] .*  ([f] * Jac * w[j]); # 
> > >>   
> > >># Option (3): 16.943314075 seconds (3232879120 
> > >>   
> > >>   fs= ([f] * Jac * w[j]);   Fe += Ns[j] .*  fs; 
> > >> 
> > >> end 
> > >> ... 
> > >> 
> > >> end 
> > >> 
> > >> What do you think?   Why is the code still getting hit with a big 
> > >> performance/memory penalty? 
> > >> 
> > >> Thanks, 
> > >> 
> > >> Petr 
> > >> 
> > >> On Monday, December 8, 2014 2:03:02 PM UTC-8, Valentin Churavy wrote: 
> > >>> I would think that when f is a 1x1 matrix Julia is allocating a new 
> 1x1 
> > >>> matrix to store the result. If it is a scalar that allocation can be 
> > >>> skipped. When this part of the code is now in a hot loop it might 
> happen 
> > >>> that you allocate millions of very small short-lived objects and 
> that 
> > >>> taxes 
> > >>> the GC quite a lot. 
>
>

Re: [julia-users] Re: Finite element code in Julia: Curious overhead in .* product

2014-12-10 Thread Tim Holy
Can you post short but complete code examples in a gist?
https://gist.github.com/
That would make it easier to follow than chasing code examples across long 
email chains.

--Tim


On Wednesday, December 10, 2014 02:35:38 PM Petr Krysl wrote:
> I don't know if this is correct, but here is a guess:
> 
> Option 3 still requires a temp array ( to calculate the result of the
> paren fs= ([f] * Jac * w[j]); ), and option 4 eliminates that temp. The
> cost of the temp over the 2 million loops is ~200MB and 0.6 sec CPU time.
> So WHY is the difference between 1 and 2 so HUUUGE?
> 
> I think this calls for someone who wrote the compiler. Guys?
> 
> Thanks a bunch,
> 
> P
> 
> On Wednesday, December 10, 2014 12:51:26 PM UTC-8, Petr Krysl wrote:
> > Actually: option (4) was also tested:
> > # 16.333766821 seconds (3008899660 bytes
> > fs[1]= f; fs *= (Jac * w[j]);
> > 
> > Fe += Ns[j] .*  fs;
> > 
> > So, allocation of memory was reduced somewhat, runtime not so much.
> > 
> > On Wednesday, December 10, 2014 12:45:20 PM UTC-8, Petr Krysl wrote:
> >> Well,  temporary array was also on my mind.  However,  things are I
> >> believe a little bit more complicated.
> >> 
> >> Here is the code with three timed options.  As you can see, the first two
> >> options are the fast one (multiplication with a scalar) and the slow one
> >> (multiplication with a one by one matrix).   In the third option I tried
> >> to
> >> avoid the creation of an  ad hoc temporary by allocating a variable
> >> outside
> >> of the loop.  The effect unfortunately is nil.
> >> 
> >> fs=[0.0]# Used only for option (3)
> >> # Now loop over all fes in the block
> >> for i=1:size(conns,1)
> >> 
> >> ...
> >> for j=1:npts
> >> 
> >>...
> >>   
> >>   # Option (1): 7.193767019 seconds (1648850568 bytes
> >>   # Fe += Ns[j] *  (f * Jac * w[j]); #
> >>   
> >># Option (2): 17.301214583 seconds (3244458368 bytes
> >>   
> >>   #Fe += Ns[j] .*  ([f] * Jac * w[j]); #
> >>   
> >># Option (3): 16.943314075 seconds (3232879120
> >>   
> >>   fs= ([f] * Jac * w[j]);   Fe += Ns[j] .*  fs;
> >> 
> >> end
> >> ...
> >> 
> >> end
> >> 
> >> What do you think?   Why is the code still getting hit with a big
> >> performance/memory penalty?
> >> 
> >> Thanks,
> >> 
> >> Petr
> >> 
> >> On Monday, December 8, 2014 2:03:02 PM UTC-8, Valentin Churavy wrote:
> >>> I would think that when f is a 1x1 matrix Julia is allocating a new 1x1
> >>> matrix to store the result. If it is a scalar that allocation can be
> >>> skipped. When this part of the code is now in a hot loop it might happen
> >>> that you allocate millions of very small short-lived objects and that
> >>> taxes
> >>> the GC quite a lot.



[julia-users] Re: Finite element code in Julia: Curious overhead in .* product

2014-12-10 Thread Petr Krysl
I don't know if this is correct, but here is a guess:

Option 3 still requires a temp array ( to calculate the result of the 
paren fs= ([f] * Jac * w[j]); ), and option 4 eliminates that temp. The 
cost of the temp over the 2 million loops is ~200MB and 0.6 sec CPU time. 
So WHY is the difference between 1 and 2 so HUUUGE?

I think this calls for someone who wrote the compiler. Guys?

Thanks a bunch,

P

On Wednesday, December 10, 2014 12:51:26 PM UTC-8, Petr Krysl wrote:
>
> Actually: option (4) was also tested:
> # 16.333766821 seconds (3008899660 bytes
> fs[1]= f; fs *= (Jac * w[j]); 
> Fe += Ns[j] .*  fs; 
>
> So, allocation of memory was reduced somewhat, runtime not so much.
>
> On Wednesday, December 10, 2014 12:45:20 PM UTC-8, Petr Krysl wrote:
>
>> Well,  temporary array was also on my mind.  However,  things are I 
>> believe a little bit more complicated.
>>
>> Here is the code with three timed options.  As you can see, the first two 
>> options are the fast one (multiplication with a scalar) and the slow one 
>> (multiplication with a one by one matrix).   In the third option I tried to 
>> avoid the creation of an  ad hoc temporary by allocating a variable outside 
>> of the loop.  The effect unfortunately is nil.
>>
>> fs=[0.0]# Used only for option (3)
>> # Now loop over all fes in the block
>> for i=1:size(conns,1)
>> ...
>> for j=1:npts
>>...
>>   # Option (1): 7.193767019 seconds (1648850568 bytes
>>   # Fe += Ns[j] *  (f * Jac * w[j]); #
>># Option (2): 17.301214583 seconds (3244458368 bytes
>>   #Fe += Ns[j] .*  ([f] * Jac * w[j]); #
>># Option (3): 16.943314075 seconds (3232879120
>>   fs= ([f] * Jac * w[j]);   Fe += Ns[j] .*  fs; 
>> end
>> ...
>> end
>>
>> What do you think?   Why is the code still getting hit with a big 
>> performance/memory penalty?
>>
>> Thanks,
>>
>> Petr
>>
>> On Monday, December 8, 2014 2:03:02 PM UTC-8, Valentin Churavy wrote:
>>
>>> I would think that when f is a 1x1 matrix Julia is allocating a new 1x1 
>>> matrix to store the result. If it is a scalar that allocation can be 
>>> skipped. When this part of the code is now in a hot loop it might happen 
>>> that you allocate millions of very small short-lived objects and that taxes 
>>> the GC quite a lot.  
>>>
>>>


[julia-users] Re: Finite element code in Julia: Curious overhead in .* product

2014-12-10 Thread Petr Krysl
Actually: option (4) was also tested:
# 16.333766821 seconds (3008899660 bytes
fs[1]= f; fs *= (Jac * w[j]); 
Fe += Ns[j] .*  fs; 

So, allocation of memory was reduced somewhat, runtime not so much.

On Wednesday, December 10, 2014 12:45:20 PM UTC-8, Petr Krysl wrote:

> Well,  temporary array was also on my mind.  However,  things are I 
> believe a little bit more complicated.
>
> Here is the code with three timed options.  As you can see, the first two 
> options are the fast one (multiplication with a scalar) and the slow one 
> (multiplication with a one by one matrix).   In the third option I tried to 
> avoid the creation of an  ad hoc temporary by allocating a variable outside 
> of the loop.  The effect unfortunately is nil.
>
> fs=[0.0]# Used only for option (3)
> # Now loop over all fes in the block
> for i=1:size(conns,1)
> ...
> for j=1:npts
>...
>   # Option (1): 7.193767019 seconds (1648850568 bytes
>   # Fe += Ns[j] *  (f * Jac * w[j]); #
># Option (2): 17.301214583 seconds (3244458368 bytes
>   #Fe += Ns[j] .*  ([f] * Jac * w[j]); #
># Option (3): 16.943314075 seconds (3232879120
>   fs= ([f] * Jac * w[j]);   Fe += Ns[j] .*  fs; 
> end
> ...
> end
>
> What do you think?   Why is the code still getting hit with a big 
> performance/memory penalty?
>
> Thanks,
>
> Petr
>
> On Monday, December 8, 2014 2:03:02 PM UTC-8, Valentin Churavy wrote:
>
>> I would think that when f is a 1x1 matrix Julia is allocating a new 1x1 
>> matrix to store the result. If it is a scalar that allocation can be 
>> skipped. When this part of the code is now in a hot loop it might happen 
>> that you allocate millions of very small short-lived objects and that taxes 
>> the GC quite a lot.  
>>
>>
>>>

[julia-users] Re: Finite element code in Julia: Curious overhead in .* product

2014-12-10 Thread Petr Krysl
Well,  temporary array was also on my mind.  However,  things are I believe 
a little bit more complicated.

Here is the code with three timed options.  As you can see, the first two 
options are the fast one (multiplication with a scalar) and the slow one 
(multiplication with a one by one matrix).   In the third option I tried to 
avoid the creation of an  ad hoc temporary by allocating a variable outside 
of the loop.  The effect unfortunately is nil.

fs=[0.0]# Used only for option (3)
# Now loop over all fes in the block
for i=1:size(conns,1)
...
for j=1:npts
   ...
  # Option (1): 7.193767019 seconds (1648850568 bytes
  # Fe += Ns[j] *  (f * Jac * w[j]); #
   # Option (2): 17.301214583 seconds (3244458368 bytes
  #Fe += Ns[j] .*  ([f] * Jac * w[j]); #
   # Option (3): 16.943314075 seconds (3232879120
  fs= ([f] * Jac * w[j]);   Fe += Ns[j] .*  fs; 
end
...
end

What do you think?   Why is the code still getting hit with a big 
performance/memory penalty?

Thanks,

Petr

On Monday, December 8, 2014 2:03:02 PM UTC-8, Valentin Churavy wrote:

> I would think that when f is a 1x1 matrix Julia is allocating a new 1x1 
> matrix to store the result. If it is a scalar that allocation can be 
> skipped. When this part of the code is now in a hot loop it might happen 
> that you allocate millions of very small short-lived objects and that taxes 
> the GC quite a lot.  
>
>
>>

[julia-users] Re: Finite element code in Julia: Curious overhead in .* product

2014-12-08 Thread Robert Gates
Although... looking closer at my suggestion, it appears to leak. Anyone 
know what I'm missing?

On Tuesday, December 9, 2014 7:31:02 AM UTC+1, Robert Gates wrote:
>
> Hi Petr,
>
> I wrote you an email.
>
> Although I haven't seen your Julia implementation, I believe you could cut 
> down on GC time by running the GC at regular intervals only. For example, 
> when I loop over something, I usually choose to purge memory every 20Mb, 
> e.g. every 1000 iterations. That way, I have experienced 10x speed-ups for 
> GC heavy problems. For example: 
> https://gist.github.com/rleegates/c305c0fec0b963257cba
>
> Best regards,
>
> Robert
>
> On Monday, December 8, 2014 9:09:19 PM UTC+1, Petr Krysl wrote:
>>
>> I posted a message yesterday about a simple port of fragments a finite 
>> element code to solve the heat conduction equation.  The Julia port was 
>> compared to the original Matlab toolkit FinEALE and the Julia code 
>> presented last year by Amuthan:
>>
>>
>> https://groups.google.com/forum/?fromgroups=#!searchin/julia-users/Krysl%7Csort:relevance/julia-users/3tTljDSQ6cs/-8UCPnNmzn4J
>>
>> I have now speeded the code up some more, so that we have the table (on 
>> my laptop, i7, 16 gig of memory):
>>
>> Amuthan's 29 seconds
>> J FinEALE 58 seconds
>> FinEALE 810 seconds
>>
>> So, given that Amuthan reports to be slower by a general-purpose C++ code 
>> by a factor of around 1.36, J FinEALE is presumably slower with respect to 
>> an equivalent FE solver coded in C++ by a factor of 2.7.  So far so good.
>>
>> The curious thing is that @time reports huge amounts of memory allocated 
>> (something like 10% of GC time).
>> One particular source of wild memory allocation was this line (executed 
>> in this case 2 million times)
>>
>> Fe += Ns[j] .*  (f * Jac * w[j]);
>>
>> where 
>> Fe = 3 x 1 matrix
>> Ns[j] = 3 by one matrix
>> f = one by one matrix
>> Jac, w[j]= scalars
>>
>> The cost of the operation that encloses this line (among many others):
>> 19.835263928 seconds (4162094480 bytes allocated, 16.79% gc time
>>
>> Changing the one by one matrix f into a scalar (and replacing .*)
>>
>> Fe += Ns[j] *  (f * Jac * w[j]);
>>
>> changed the cost quite drastically:
>> 10.105620394 seconds (2738120272 bytes allocated, 21.33% gc time
>>
>> Any ideas, you Julian wizards?
>>
>> Thanks,
>>
>> Petr
>>
>

[julia-users] Re: Finite element code in Julia: Curious overhead in .* product

2014-12-08 Thread Robert Gates
Hi Petr,

I wrote you an email.

Although I haven't seen your Julia implementation, I believe you could cut 
down on GC time by running the GC at regular intervals only. For example, 
when I loop over something, I usually choose to purge memory every 20Mb, 
e.g. every 1000 iterations. That way, I have experienced 10x speed-ups for 
GC heavy problems. For example: 
https://gist.github.com/rleegates/c305c0fec0b963257cba

Best regards,

Robert

On Monday, December 8, 2014 9:09:19 PM UTC+1, Petr Krysl wrote:
>
> I posted a message yesterday about a simple port of fragments a finite 
> element code to solve the heat conduction equation.  The Julia port was 
> compared to the original Matlab toolkit FinEALE and the Julia code 
> presented last year by Amuthan:
>
>
> https://groups.google.com/forum/?fromgroups=#!searchin/julia-users/Krysl%7Csort:relevance/julia-users/3tTljDSQ6cs/-8UCPnNmzn4J
>
> I have now speeded the code up some more, so that we have the table (on my 
> laptop, i7, 16 gig of memory):
>
> Amuthan's 29 seconds
> J FinEALE 58 seconds
> FinEALE 810 seconds
>
> So, given that Amuthan reports to be slower by a general-purpose C++ code 
> by a factor of around 1.36, J FinEALE is presumably slower with respect to 
> an equivalent FE solver coded in C++ by a factor of 2.7.  So far so good.
>
> The curious thing is that @time reports huge amounts of memory allocated 
> (something like 10% of GC time).
> One particular source of wild memory allocation was this line (executed in 
> this case 2 million times)
>
> Fe += Ns[j] .*  (f * Jac * w[j]);
>
> where 
> Fe = 3 x 1 matrix
> Ns[j] = 3 by one matrix
> f = one by one matrix
> Jac, w[j]= scalars
>
> The cost of the operation that encloses this line (among many others):
> 19.835263928 seconds (4162094480 bytes allocated, 16.79% gc time
>
> Changing the one by one matrix f into a scalar (and replacing .*)
>
> Fe += Ns[j] *  (f * Jac * w[j]);
>
> changed the cost quite drastically:
> 10.105620394 seconds (2738120272 bytes allocated, 21.33% gc time
>
> Any ideas, you Julian wizards?
>
> Thanks,
>
> Petr
>


[julia-users] Re: Finite element code in Julia: Curious overhead in .* product

2014-12-08 Thread Petr Krysl
Hi Robert,

At this point I am really impressed with Julia.  Consequently, I am tempted 
to rewrite my Matlab toolkit FinEALE 
(https://github.com/PetrKryslUCSD/FinEALE) in Julia and to implement 
further research ideas in this new environment. 

What did you have in mind?

Petr
pkr...@ucsd.edu

On Monday, December 8, 2014 6:41:02 PM UTC-8, Robert Gates wrote:
>
> Hi Petr,
>
> just on a side-note: are you planning on implementing a complete FE 
> package in Julia? In that case we could pool our efforts.
>
>
>

[julia-users] Re: Finite element code in Julia: Curious overhead in .* product

2014-12-08 Thread Robert Gates
Hi Petr,

just on a side-note: are you planning on implementing a complete FE package 
in Julia? In that case we could pool our efforts.

Best regards,

Robert

On Monday, December 8, 2014 9:09:19 PM UTC+1, Petr Krysl wrote:
>
> I posted a message yesterday about a simple port of fragments a finite 
> element code to solve the heat conduction equation.  The Julia port was 
> compared to the original Matlab toolkit FinEALE and the Julia code 
> presented last year by Amuthan:
>
>
> https://groups.google.com/forum/?fromgroups=#!searchin/julia-users/Krysl%7Csort:relevance/julia-users/3tTljDSQ6cs/-8UCPnNmzn4J
>
> I have now speeded the code up some more, so that we have the table (on my 
> laptop, i7, 16 gig of memory):
>
> Amuthan's 29 seconds
> J FinEALE 58 seconds
> FinEALE 810 seconds
>
> So, given that Amuthan reports to be slower by a general-purpose C++ code 
> by a factor of around 1.36, J FinEALE is presumably slower with respect to 
> an equivalent FE solver coded in C++ by a factor of 2.7.  So far so good.
>
> The curious thing is that @time reports huge amounts of memory allocated 
> (something like 10% of GC time).
> One particular source of wild memory allocation was this line (executed in 
> this case 2 million times)
>
> Fe += Ns[j] .*  (f * Jac * w[j]);
>
> where 
> Fe = 3 x 1 matrix
> Ns[j] = 3 by one matrix
> f = one by one matrix
> Jac, w[j]= scalars
>
> The cost of the operation that encloses this line (among many others):
> 19.835263928 seconds (4162094480 bytes allocated, 16.79% gc time
>
> Changing the one by one matrix f into a scalar (and replacing .*)
>
> Fe += Ns[j] *  (f * Jac * w[j]);
>
> changed the cost quite drastically:
> 10.105620394 seconds (2738120272 bytes allocated, 21.33% gc time
>
> Any ideas, you Julian wizards?
>
> Thanks,
>
> Petr
>


[julia-users] Re: Finite element code in Julia: Curious overhead in .* product

2014-12-08 Thread Valentin Churavy
I would think that when f is a 1x1 matrix Julia is allocating a new 1x1 
matrix to store the result. If it is a scalar that allocation can be 
skipped. When this part of the code is now in a hot loop it might happen 
that you allocate millions of very small short-lived objects and that taxes 
the GC quite a lot.  

On Monday, 8 December 2014 21:09:19 UTC+1, Petr Krysl wrote:
>
> I posted a message yesterday about a simple port of fragments a finite 
> element code to solve the heat conduction equation.  The Julia port was 
> compared to the original Matlab toolkit FinEALE and the Julia code 
> presented last year by Amuthan:
>
>
> https://groups.google.com/forum/?fromgroups=#!searchin/julia-users/Krysl%7Csort:relevance/julia-users/3tTljDSQ6cs/-8UCPnNmzn4J
>
> I have now speeded the code up some more, so that we have the table (on my 
> laptop, i7, 16 gig of memory):
>
> Amuthan's 29 seconds
> J FinEALE 58 seconds
> FinEALE 810 seconds
>
> So, given that Amuthan reports to be slower by a general-purpose C++ code 
> by a factor of around 1.36, J FinEALE is presumably slower with respect to 
> an equivalent FE solver coded in C++ by a factor of 2.7.  So far so good.
>
> The curious thing is that @time reports huge amounts of memory allocated 
> (something like 10% of GC time).
> One particular source of wild memory allocation was this line (executed in 
> this case 2 million times)
>
> Fe += Ns[j] .*  (f * Jac * w[j]);
>
> where 
> Fe = 3 x 1 matrix
> Ns[j] = 3 by one matrix
> f = one by one matrix
> Jac, w[j]= scalars
>
> The cost of the operation that encloses this line (among many others):
> 19.835263928 seconds (4162094480 bytes allocated, 16.79% gc time
>
> Changing the one by one matrix f into a scalar (and replacing .*)
>
> Fe += Ns[j] *  (f * Jac * w[j]);
>
> changed the cost quite drastically:
> 10.105620394 seconds (2738120272 bytes allocated, 21.33% gc time
>
> Any ideas, you Julian wizards?
>
> Thanks,
>
> Petr
>