Frankly honored and flattered.  That is clever to use the result of the == 
comparator, which yields a 1 or a 0, to avoid an if/else branch.  Also, 
using assignment to a tuple of matrix indices to avoid allocating a 
temporary vector.  The difficulty of mix_columns! is that each calculation 
depends on all four values in that column of the state, so the state cannot 
be updated until all the results have been obtained.  I wonder, does this 
tuple assignment method work substantially differently at the code level?  
How is the allocation avoided?  By storing in a register?

It looks like you might have access to a different version of Julia than 
the one I'm using, so I can't verify it myself with your code, but in 0.3 I 
found using the @inbounds tag decreased performance each time.  

Thanks for your attention. 

On Saturday, September 12, 2015 at 4:38:50 PM UTC-4, Simon Kornblith wrote:
>
> With some tweaks I got a 12x speedup. Original (using Iain's bench with 
> 100000 iterations):
>
>  0.639475 seconds (4.10 M allocations: 279.236 MB, 1.96% gc time) 
>  0.634781 seconds (4.10 M allocations: 279.236 MB, 1.90% gc time)
>
> With 9ab84caa046d687928642a27c30c85336efc876c 
> <https://github.com/simonster/crypto/commit/9ab84caa046d687928642a27c30c85336efc876c>
>  
> from my fork (which avoids allocations, adds inbounds, inlines gf_mult, and 
> avoids some branches):
>
>  0.091223 seconds 
>  0.090931 seconds
>
> With 3694517e7737fe35f59172666da9971f701189ab 
> <https://github.com/simonster/crypto/commit/3694517e7737fe35f59172666da9971f701189ab>,
>  
> which uses a lookup table for gf_mult:
>
>  0.062077 seconds
>  0.062132 seconds
>
> With 6e05894856e2bec372b75cd52ae91f36731d2096 
> <https://github.com/simonster/crypto/commit/6e05894856e2bec372b75cd52ae91f36731d2096>,
>  
> which uglifies shift_rows! for performance:
>
>  0.052652 seconds 
>  0.052450 seconds
>
> There is probably a way to make gf_mult faster without using a lookup 
> table, since in many cases it's probably doing the same work several times, 
> but I didn't put much thought into it.
>
> Simon
>
> On Saturday, September 12, 2015 at 2:29:49 PM UTC-4, Kristoffer Carlsson 
> wrote:
>>
>> I played around with devectorization and made it allocation free but only 
>> got the time down by a factor of 2.
>>
>> Most of the time is spent in gf_mult anyway and I don't know how to 
>> optimize that one. If the C library is using a similar function, maybe 
>> looking at the generated code to see what is different.
>>
>

Reply via email to