Values in tuples are either stored in registers or on the stack as decided 
by the compiler, whereas Arrays are currently always allocated on the heap. 
Allocating on the heap is more expensive to begin with, and heap allocated 
objects have to be garbage collected when they're no longer in use.

I did this with Julia 0.4, which as Iain noted above appears to be 
substantially faster at this than 0.3 even without changes.

Simon

On Saturday, September 12, 2015 at 8:33:02 PM UTC-4, Corey Moncure wrote:
>
> Frankly honored and flattered.  That is clever to use the result of the == 
> comparator, which yields a 1 or a 0, to avoid an if/else branch.  Also, 
> using assignment to a tuple of matrix indices to avoid allocating a 
> temporary vector.  The difficulty of mix_columns! is that each calculation 
> depends on all four values in that column of the state, so the state cannot 
> be updated until all the results have been obtained.  I wonder, does this 
> tuple assignment method work substantially differently at the code level?  
> How is the allocation avoided?  By storing in a register?
>
> It looks like you might have access to a different version of Julia than 
> the one I'm using, so I can't verify it myself with your code, but in 0.3 I 
> found using the @inbounds tag decreased performance each time.  
>
> Thanks for your attention. 
>
> On Saturday, September 12, 2015 at 4:38:50 PM UTC-4, Simon Kornblith wrote:
>>
>> With some tweaks I got a 12x speedup. Original (using Iain's bench with 
>> 100000 iterations):
>>
>>  0.639475 seconds (4.10 M allocations: 279.236 MB, 1.96% gc time) 
>>  0.634781 seconds (4.10 M allocations: 279.236 MB, 1.90% gc time)
>>
>> With 9ab84caa046d687928642a27c30c85336efc876c 
>> <https://github.com/simonster/crypto/commit/9ab84caa046d687928642a27c30c85336efc876c>
>>  
>> from my fork (which avoids allocations, adds inbounds, inlines gf_mult, and 
>> avoids some branches):
>>
>>  0.091223 seconds 
>>  0.090931 seconds
>>
>> With 3694517e7737fe35f59172666da9971f701189ab 
>> <https://github.com/simonster/crypto/commit/3694517e7737fe35f59172666da9971f701189ab>,
>>  
>> which uses a lookup table for gf_mult:
>>
>>  0.062077 seconds
>>  0.062132 seconds
>>
>> With 6e05894856e2bec372b75cd52ae91f36731d2096 
>> <https://github.com/simonster/crypto/commit/6e05894856e2bec372b75cd52ae91f36731d2096>,
>>  
>> which uglifies shift_rows! for performance:
>>
>>  0.052652 seconds 
>>  0.052450 seconds
>>
>> There is probably a way to make gf_mult faster without using a lookup 
>> table, since in many cases it's probably doing the same work several times, 
>> but I didn't put much thought into it.
>>
>> Simon
>>
>> On Saturday, September 12, 2015 at 2:29:49 PM UTC-4, Kristoffer Carlsson 
>> wrote:
>>>
>>> I played around with devectorization and made it allocation free but 
>>> only got the time down by a factor of 2.
>>>
>>> Most of the time is spent in gf_mult anyway and I don't know how to 
>>> optimize that one. If the C library is using a similar function, maybe 
>>> looking at the generated code to see what is different.
>>>
>>

Reply via email to