On Thu, Jan 07, 2010 at 02:55:36PM +0100, M vd S wrote: > On 1/6/10 9:27 PM, sascha wrote: > > On Wed, Jan 06, 2010 at 04:04:37PM +0100, M vd S wrote: > > x86 is slow for bit-operations, true. But it's L2 cache is huge (8mb/cpu > here). And I think the number of Xeons in the world is orders of > magnitude bigger than the number of GPU's. Ops/watt are not our biggest > concern. > > Most bit-operations can be replaced with smart lookup tables. The A5/1 > LFSR's are quite small enough to allow useful tables to fit in the L2 > cache. Mapping 20 bits of input to 16 bits of information, for example, > consumes only 2 mb. > > Has this been considered in the "order of magnitude" difference between > the platforms? > > > As a simple example, consider predicting the next 6 shifts. One would > use the clock bits and their 5 neighbors - that's 18 bits of index. The > table will then give 12 bits indicating the next 6 shift patterns (of > which there are only 4 combinations, hence 2 bits per shift). This would > cost only 1<<19 bytes or 0.5 mb. > This table lookup scheme has beed used for the CUDA implementation, computing 3x4 clocking bits from 4+4+4 register bits in a 12 bit lookup table. A bigger table meant that all cores would access main memory concurrently and so was faster when run with a single core and waaayyyy slower when run on 216 cores on the GPU. the 12 bit table fits into the on chip software managaged L1 cache.
Then we found out that x86 cpus are exceptionally good at bit operations, if you layout your data correcty. using 64 128bit registers we could compute the majority 128 way parallel using: r1r2 = ~(regs[R1BASE + 8] ^ regs[R2BASE + 10]) r1r3 = ~(regs[R1BASE + 8] ^ regs[R2BASE + 10]) r2r3 = ~(regs[R2BASE + 10] ^ regs[R3BASE + 10]) clockr1 = r1r2 | r1r3; clockr2 = r1r3 | r2r3; clockr3 = r1r3 | r2r3; this is not 128 times faster, because the shifting of the registers has linear complexity over the number of bits in A5/1: 64 iterations of a loop are required to implement the shift for R1, R2, R3. On cuda, the bitslicing code runs 3 times faster than the table lookup code, using a mixture of 64bit and 32bit wide SIMD. The SSE bitslice code is still being composted and it has to be integrated into the framework. The cell also has 128bit wide vectors and the core of its implementation is identical to the SSE implementation. It also has to be integrated. The ati implementation uses 320 128bit wide cores at 700Mhz (i suppose) on HD5xxx series cards versus 240 (32 / 64) bit wide cores at 1300Mhz on nvidia GT200 series and they both produce about 500 million a5/1 rounds per second. while the ati version is available as a seperate program, the nvidia version is being beta tested at the moment. the sse implementation will be available very soon now. _______________________________________________ A51 mailing list [email protected] http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51
