Frankly honored and flattered. That is clever to use the result of the == comparator, which yields a 1 or a 0, to avoid an if/else branch. Also, using assignment to a tuple of matrix indices to avoid allocating a temporary vector. The difficulty of mix_columns! is that each calculation depends on all four values in that column of the state, so the state cannot be updated until all the results have been obtained. I wonder, does this tuple assignment method work substantially differently at the code level? How is the allocation avoided? By storing in a register?
It looks like you might have access to a different version of Julia than the one I'm using, so I can't verify it myself with your code, but in 0.3 I found using the @inbounds tag decreased performance each time. Thanks for your attention. On Saturday, September 12, 2015 at 4:38:50 PM UTC-4, Simon Kornblith wrote: > > With some tweaks I got a 12x speedup. Original (using Iain's bench with > 100000 iterations): > > 0.639475 seconds (4.10 M allocations: 279.236 MB, 1.96% gc time) > 0.634781 seconds (4.10 M allocations: 279.236 MB, 1.90% gc time) > > With 9ab84caa046d687928642a27c30c85336efc876c > <https://github.com/simonster/crypto/commit/9ab84caa046d687928642a27c30c85336efc876c> > > from my fork (which avoids allocations, adds inbounds, inlines gf_mult, and > avoids some branches): > > 0.091223 seconds > 0.090931 seconds > > With 3694517e7737fe35f59172666da9971f701189ab > <https://github.com/simonster/crypto/commit/3694517e7737fe35f59172666da9971f701189ab>, > > which uses a lookup table for gf_mult: > > 0.062077 seconds > 0.062132 seconds > > With 6e05894856e2bec372b75cd52ae91f36731d2096 > <https://github.com/simonster/crypto/commit/6e05894856e2bec372b75cd52ae91f36731d2096>, > > which uglifies shift_rows! for performance: > > 0.052652 seconds > 0.052450 seconds > > There is probably a way to make gf_mult faster without using a lookup > table, since in many cases it's probably doing the same work several times, > but I didn't put much thought into it. > > Simon > > On Saturday, September 12, 2015 at 2:29:49 PM UTC-4, Kristoffer Carlsson > wrote: >> >> I played around with devectorization and made it allocation free but only >> got the time down by a factor of 2. >> >> Most of the time is spent in gf_mult anyway and I don't know how to >> optimize that one. If the C library is using a similar function, maybe >> looking at the generated code to see what is different. >> >