Re: [A51] x86 / SSE

sascha Thu, 07 Jan 2010 07:34:51 -0800

On Thu, Jan 07, 2010 at 02:55:36PM +0100, M vd S wrote:
> On 1/6/10 9:27 PM, sascha wrote:
> > On Wed, Jan 06, 2010 at 04:04:37PM +0100, M vd S wrote:
> 
> x86 is slow for bit-operations, true. But it's L2 cache is huge (8mb/cpu 
> here). And I think the number of Xeons in the world is orders of 
> magnitude bigger than the number of GPU's. Ops/watt are not our biggest 
> concern.
> 
> Most bit-operations can be replaced with smart lookup tables. The A5/1 
> LFSR's are quite small enough to allow useful tables to fit in the L2 
> cache. Mapping 20 bits of input to 16 bits of information, for example, 
> consumes only 2 mb.
> 
> Has this been considered in the "order of magnitude" difference between 
> the platforms?
> 
> 
> As a simple example, consider predicting the next 6 shifts. One would 
> use the clock bits and their 5 neighbors - that's 18 bits of index. The 
> table will then give 12 bits indicating the next 6 shift patterns (of 
> which there are only 4 combinations, hence 2 bits per shift). This would 
> cost only 1<<19 bytes or 0.5 mb.
> 
This table lookup scheme has beed used for the CUDA implementation, computing
3x4 clocking bits from 4+4+4 register bits in a 12 bit lookup table.
A bigger table meant that all cores would access main memory concurrently
and so was faster when run with a single core and waaayyyy slower when run
on 216 cores on the GPU. the 12 bit table fits into the on chip software
managaged L1 cache.


Then we found out that x86 cpus are exceptionally good at bit operations,
if you layout your data correcty. using 64 128bit registers we could
compute the majority 128 way parallel using:
r1r2 = ~(regs[R1BASE + 8] ^ regs[R2BASE + 10])
r1r3 = ~(regs[R1BASE + 8] ^ regs[R2BASE + 10])
r2r3 = ~(regs[R2BASE + 10] ^ regs[R3BASE + 10])

clockr1 = r1r2 | r1r3;
clockr2 = r1r3 | r2r3;
clockr3 = r1r3 | r2r3;

this is not 128 times faster, because the shifting of the registers has
linear complexity over the number of bits in A5/1: 64 iterations of a loop
are required to implement the shift for R1, R2, R3.

On cuda, the bitslicing code runs 3 times faster than the table lookup code,
using a mixture of 64bit and 32bit wide SIMD.

The SSE bitslice code is still being composted and it has to be integrated
into the framework. The cell also has 128bit wide vectors and the core of
its implementation is identical to the SSE implementation. It also has to be
integrated. The ati implementation uses 320 128bit wide cores at 700Mhz
(i suppose) on HD5xxx series cards
versus 240 (32 / 64) bit wide cores at 1300Mhz on nvidia GT200 series and they
both produce about 500 million a5/1 rounds per second.
while the ati version is available as a seperate program, the nvidia version
is being beta tested at the moment. the sse implementation will be available
very soon now.
_______________________________________________
A51 mailing list
[email protected]
http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51

Re: [A51] x86 / SSE

Reply via email to