I don't know if this is correct, but here is a guess:

Option 3 still requires a temp array ( to calculate the result of the 
paren fs= ([f] * Jac * w[j]); ), and option 4 eliminates that temp. The 
cost of the temp over the 2 million loops is ~200MB and 0.6 sec CPU time. 
So WHY is the difference between 1 and 2 so HUUUGE?

I think this calls for someone who wrote the compiler. Guys?

Thanks a bunch,

P

On Wednesday, December 10, 2014 12:51:26 PM UTC-8, Petr Krysl wrote:
>
> Actually: option (4) was also tested:
> # 16.333766821 seconds (3008899660 bytes
> fs[1]= f; fs *= (Jac * w[j]); 
>             Fe += Ns[j] .*  fs; 
>
> So, allocation of memory was reduced somewhat, runtime not so much.
>
> On Wednesday, December 10, 2014 12:45:20 PM UTC-8, Petr Krysl wrote:
>
>> Well,  temporary array was also on my mind.  However,  things are I 
>> believe a little bit more complicated.
>>
>> Here is the code with three timed options.  As you can see, the first two 
>> options are the fast one (multiplication with a scalar) and the slow one 
>> (multiplication with a one by one matrix).   In the third option I tried to 
>> avoid the creation of an  ad hoc temporary by allocating a variable outside 
>> of the loop.  The effect unfortunately is nil.
>>
>>     fs=[0.0]# Used only for option (3)
>>     # Now loop over all fes in the block
>>     for i=1:size(conns,1)
>>         ...
>>         for j=1:npts
>>            ...
>>           # Option (1): 7.193767019 seconds (1648850568 bytes
>>           # Fe += Ns[j] *  (f * Jac * w[j]); #
>>            # Option (2): 17.301214583 seconds (3244458368 bytes
>>           #    Fe += Ns[j] .*  ([f] * Jac * w[j]); #
>>            # Option (3): 16.943314075 seconds (3232879120
>>           fs= ([f] * Jac * w[j]);   Fe += Ns[j] .*  fs; 
>>         end
>>         ...
>>     end
>>
>> What do you think?   Why is the code still getting hit with a big 
>> performance/memory penalty?
>>
>> Thanks,
>>
>> Petr
>>
>> On Monday, December 8, 2014 2:03:02 PM UTC-8, Valentin Churavy wrote:
>>
>>> I would think that when f is a 1x1 matrix Julia is allocating a new 1x1 
>>> matrix to store the result. If it is a scalar that allocation can be 
>>> skipped. When this part of the code is now in a hot loop it might happen 
>>> that you allocate millions of very small short-lived objects and that taxes 
>>> the GC quite a lot.  
>>>
>>>
>>>>

Reply via email to