[A51] CPU implementation (more code)
I cleaned up the ATI brook code, and made a new version of the shared library, that uses only the CPU for generating chains. The code can be found here (linux only ATM): http://traxme.net/a5/a5_cpu.tar.gz There is a small python frontend that will generate real chains, (follow the instructions in my previous post about using the python script) - also the script can be edited to set the number of threads (cores) you wish to use. An AMD Phenom x4 @ 3.2 GHz makes around 16 chains / second. I suppose there is room for assembly optimization here, but that isn't really the point. I am writing this code, to look into efficient table lookup once that tables have been generated. The idea is to spread the lookup part to machines that may not have a GPU. On my machine the code will cause 32 lookups to disk / second, hardly a cause for alarm, so a bog standard hard disk will do. But if the GPU is used for lookup, the rate will be much higher (320/sec) - and I am currently copying sorted tables to a slow USB flash drive, to determine of it can keep up with the pace with respect to lookup / reads. (It takes some hours to copy 1 million files down to this device, so I may change my approach to sorting, to something that is more in line with the 64kb block size commonly found on 8GB flash disks cheers, Frank ___ A51 mailing list A51@lists.reflextor.com http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51
Re: [A51] CPU implementation (more code)
Nice work Frank. Just wondering, did anyone ever look at using SSE2 for faster CPU code (or maybe even tried)? Haven't looked at A5/1 very well, but skipping through A5Cpu.cpp (this is the relevant code, right?) the main loop seems to consist of mostly bitshifts, xors, ors and ands. That should perform alright in SSE. Daniël Frank A. Stevenson schreef: > I cleaned up the ATI brook code, and made a new version of the shared > library, that uses only the CPU for generating chains. The code can be > found here (linux only ATM): > > http://traxme.net/a5/a5_cpu.tar.gz > > There is a small python frontend that will generate real chains, (follow > the instructions in my previous post about using the python script) - > also the script can be edited to set the number of threads (cores) you > wish to use. > > An AMD Phenom x4 @ 3.2 GHz makes around 16 chains / second. I suppose > there is room for assembly optimization here, but that isn't really the > point. I am writing this code, to look into efficient table lookup once > that tables have been generated. The idea is to spread the lookup part > to machines that may not have a GPU. > > On my machine the code will cause 32 lookups to disk / second, hardly a > cause for alarm, so a bog standard hard disk will do. > > But if the GPU is used for lookup, the rate will be much higher > (320/sec) - and I am currently copying sorted tables to a slow USB flash > drive, to determine of it can keep up with the pace with respect to > lookup / reads. (It takes some hours to copy 1 million files down to > this device, so I may change my approach to sorting, to something that > is more in line with the 64kb block size commonly found on 8GB flash disks > > cheers, > Frank > > > > > > > ___ > A51 mailing list > A51@lists.reflextor.com > http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51 > > ___ A51 mailing list A51@lists.reflextor.com http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51
Re: [A51] CPU implementation (more code)
Regarding CPU implementations: I have attached a 64 wide SIMD implementation (single instruction 64 bits of data) that does 8 million a5/1 rounds per second on a single core of a 2ghz core2duo. I am not sure whether -O3 uses SSE automagically, so i assume not. I use 64 64bit wide memory locations (so effectively L1 cache) to store a51 registers vertically, and so i can generate a keystream bit for 64 independent a5/1 chains with a single set of instructions: (r1<18>(regs) ^ r2<21>(regs) ^ r3<22>(regs)); so r1<18>(regs) points to a 64bit memory location that stores bit 18 of R1 for 64 A5/1 engines. the performance bottleneck is the shifting of R1,R2,R3, since you have to touch each register. the calculation of the keystream shown above uses 2 xor instructions, but since you calculate 64bits, effectively only 2/64 xor per chain is needed. The shift, which is essentially this: for (i = base - length + 1; i < base; ++i) { regs[i] = regs[i + nshift] & clock2 | regs[i + nshift + 1] & clock3; } uses 6 instructions 64 times (effectively 6 instructions per chain). with 8M A51 rounds per second (each round calculating 64bit), i can generate 512 million keystream bits per second, at 2ghz that is only 4 clocks per bit. it does not yet run very well on NVIDIA though, but i expect the playstation with 128bit registers to perform very well using this technique. Maybe someone can port that to SSE or Frank you can try to write an ATI implementation. if it is possible, those ATI cards would skyrocket, since with proper code generation this implementation can be made to store values in registers only and not use shared memory which is scarce on ATI. But if IIRC, ati gpus have 256 registers per thread, and only 64 * 3 are needed. (64 for R1,R2,R3; 64 for the result; 64 for the round function values; and some for the calculations) The handling of the round function is not implemented properly. If you want to calculate values to check against the reference implementation, you should calculate like 1000 A5/1 rounds (so the round function does not change). With advance 0, where the round function value is 0 and value ^ 0 == value, the reference implementation also calculates round function-less chains. > Nice work Frank. > > Just wondering, did anyone ever look at using SSE2 for faster CPU code > (or maybe even tried)? > Haven't looked at A5/1 very well, but skipping through A5Cpu.cpp (this > is the relevant code, right?) the main loop seems to consist of mostly > bitshifts, xors, ors and ands. That should perform alright in SSE. > > Daniël > > Frank A. Stevenson schreef: > > I cleaned up the ATI brook code, and made a new version of the shared > > library, that uses only the CPU for generating chains. The code can be > > found here (linux only ATM): > > > > http://traxme.net/a5/a5_cpu.tar.gz > > > > There is a small python frontend that will generate real chains, (follow > > the instructions in my previous post about using the python script) - > > also the script can be edited to set the number of threads (cores) you > > wish to use. > > > > An AMD Phenom x4 @ 3.2 GHz makes around 16 chains / second. I suppose > > there is room for assembly optimization here, but that isn't really the > > point. I am writing this code, to look into efficient table lookup once > > that tables have been generated. The idea is to spread the lookup part > > to machines that may not have a GPU. > > > > On my machine the code will cause 32 lookups to disk / second, hardly a > > cause for alarm, so a bog standard hard disk will do. > > > > But if the GPU is used for lookup, the rate will be much higher > > (320/sec) - and I am currently copying sorted tables to a slow USB flash > > drive, to determine of it can keep up with the pace with respect to > > lookup / reads. (It takes some hours to copy 1 million files down to > > this device, so I may change my approach to sorting, to something that > > is more in line with the 64kb block size commonly found on 8GB flash disks > > > > cheers, > > Frank > > > > > > > > > > > > > > ___ > > A51 mailing list > > A51@lists.reflextor.com > > http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51 > > > > > ___ > A51 mailing list > A51@lists.reflextor.com > http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51 #include #include #include #include //#include template void vconvert(T * in, T * out, int n, int shift, T mask) { T notmask = ~mask; for (int i = 0, j = n / 2, outi = 0; i < n / 2; ++i, ++j, outi += 2) { out[outi] = in[i] & mask | in[j] << shift & notmask; out[outi + 1] = in[j] & notmask | in[i] >> shift & mask; } } uint64_t masks[6] = { 0x, 0x, 0x00ff00ff, 0x0f0f0f0f, 0x, 0x }; template T * do_vconvert(T * in, T * out, int n, uint64_t * masks) { int shift = 32; for (int i = 0; i < 6; +
Re: [A51] CPU implementation (more code)
sascha wrote: > Regarding CPU implementations: > > I have attached a 64 wide SIMD implementation (single instruction 64 > bits of data) > that does 8 million a5/1 rounds per second on a single core > of a 2ghz core2duo. I am not sure whether -O3 uses SSE automagically, > so i assume not. > I use 64 64bit wide memory locations (so effectively L1 cache) to > store a51 registers vertically, and so i can generate a keystream bit > for 64 independent a5/1 chains with a single set of instructions: > > (r1<18>(regs) ^ r2<21>(regs) ^ r3<22>(regs)); > > so r1<18>(regs) points to a 64bit memory location that stores bit 18 > of R1 for 64 A5/1 engines. > > the performance bottleneck is the shifting of R1,R2,R3, since you have > to touch each register. > > I like this idea (it would be great to make a practical implementation) ... without giving to much thought about it, it seems that the shifting can be avoided by unrolling the loop, and using the 64 registers as a ring-buffer. Hurray: more templates :-) At the end of 64 iterations, you could then the 15 lsb-registers together, and inspect the result for 0 bits, (distinguising points) - The round functions can be held in another set of 64 bits registers/memory holding and the relevant round function(s) can be updated across the registers. Frank ___ A51 mailing list A51@lists.reflextor.com http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51
Re: [A51] CPU implementation (more code)
Frank A. Stevenson wrote: > I like this idea (it would be great to make a practical implementation) > ... without giving to much thought about it, it seems that the shifting > can be avoided by unrolling the loop, and using the 64 registers as a > ring-buffer. I obviously did not think to hard about this :-( The amount of shifting is a problem, especially with the way ATI handles branches. (shifting only 2 around takes the same amount of time as shifting all 3) - I will need to think of something cleverer. :-) Frank ___ A51 mailing list A51@lists.reflextor.com http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51
Re: [A51] CPU implementation (more code)
Forward to the list from PM > So you now use m128i registers and do instructions on 128 bit wide yes > values? I assume that what you are doing is what is usually called bit > slicing? yes > For both Cell and SSE2, do you use interlacing of instructions to fill > up the pipeline (and usually improve speed by a factor 2 or 3)? (not > sure if that really helps on bit sliced calculations though) i did not pay special attention to the instruction scheduling. The heavy load loop is unrolled and gcc will have plenty of opportunity to reorder the instructions. Also the loop body is executed ~ 650m times per second on a 2ghz core, effectivly using 3 clocks per loop for the 7 instructions (unrolled) on the cell spu it is half as fast despite higher clocks, so i should try to optimize there. (but the SPU is also not fully multiscalar, so should be half as fast) with the SPU being in-order, instruction instruction reordering and scheduling may pay off. This function does 90% of the work: nshift == 0, lenght= 19 or 22 or 23, RT=ssevector(uint64, uint64) template void lsh_reg(RT * regs, RT clock3) { RT clock2 = ~clock3; int i; for (i = base - length + 1; i < base; ++i) { regs[i] = regs[i + nshift] & clock2 | regs[i + nshift + 1] & clock3; } Before loop unrolling: xmm0: clock2 xmm6: ~clock2 720(...): regs[i + nshift] 736(...): regs[i + nshift + 1] L41: .loc 2 110 0 movdqa 720(%ecx,%eax), %xmm1 movdqa 736(%ecx,%eax), %xmm2 L18: .loc 2 111 0 pand%xmm0, %xmm1 pand%xmm6, %xmm2 por %xmm2, %xmm1 movdqa %xmm1, 720(%ecx,%eax) addl$16, %eax .loc 2 110 0 cmpl$288, %eax jne L41 After loop unrolling this pattern repeats: xmm2, xmm6: clock2, ~clock2, 720(), 736() regs[...] movdqa %xmm2, %xmm0 movdqa %xmm6, %xmm1 pand720(%ebx,%eax), %xmm0 pand736(%ebx,%eax), %xmm1 por %xmm1, %xmm0 movdqa %xmm0, 720(%ebx,%eax) leal32(%edx), %eax __ GRATIS für alle WEB.DE-Nutzer: Die maxdome Movie-FLAT! Jetzt freischalten unter http://movieflat.web.de ___ A51 mailing list A51@lists.reflextor.com http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51
Re: [A51] CPU implementation (more code)
> Regarding CPU implementations: Some good news: The SSE version achieves 18 chains/second per core of my 2ghz core2duo. The Cell version runs at 10 chains/second per SPU (3.2ghz). some polishing has to be done before the code can be released. Anyone here paying the rent for roadrunner for 24 hours? ___ A51 mailing list A51@lists.reflextor.com http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51