On Mon, 2010-01-11 at 05:09 +0100, sascha wrote:

> due to the bitslicing none of those optimizations are applicable.
> I was expecting a smaller penalty also, due to the fact that i could
> avoid DRAM access. 

If there are real increases in efficiency to be had here, we could move
away from the bitslicing approach. Non bitsliced code runs at around 180
chains pr second on a single GPU, whereas bitsliced runs at 500. I did
notice however that non bitsliced code was running at around the same
speed when using tables and some clever multiplication trick to drive
the LFSRs.

However DX11 has introduced the "NSA instruction" (bit population count)
to the GPU instruction set. Using this to drive the LFSRs would make it
faster than before, and in my implementation at least, processing 2
chains pr thread instead of just a single would give another speed
increase. (Better GPU utilization, by filling slots that are unused due
to data dependencies) A5/1 could be clocked forward with ~32 operands,
making theoretic max speed about the same is the bitsliced speeds are in
practice. So if efficiency can be increased by removing invalid states,
non-bitsliced code may actually take the lead again.

f


_______________________________________________
A51 mailing list
A51@lists.reflextor.com
http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51

Reply via email to