On Mon, 2010-01-11 at 05:09 +0100, sascha wrote: > due to the bitslicing none of those optimizations are applicable. > I was expecting a smaller penalty also, due to the fact that i could > avoid DRAM access.
If there are real increases in efficiency to be had here, we could move away from the bitslicing approach. Non bitsliced code runs at around 180 chains pr second on a single GPU, whereas bitsliced runs at 500. I did notice however that non bitsliced code was running at around the same speed when using tables and some clever multiplication trick to drive the LFSRs. However DX11 has introduced the "NSA instruction" (bit population count) to the GPU instruction set. Using this to drive the LFSRs would make it faster than before, and in my implementation at least, processing 2 chains pr thread instead of just a single would give another speed increase. (Better GPU utilization, by filling slots that are unused due to data dependencies) A5/1 could be clocked forward with ~32 operands, making theoretic max speed about the same is the bitsliced speeds are in practice. So if efficiency can be increased by removing invalid states, non-bitsliced code may actually take the lead again. f _______________________________________________ A51 mailing list A51@lists.reflextor.com http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51