[A51] CPU implementation (more code)

2009-11-02 Thread Frank A. Stevenson
I cleaned up the ATI brook code, and made a new version of the shared 
library, that uses only the CPU for generating chains. The code can be 
found here (linux only ATM):

http://traxme.net/a5/a5_cpu.tar.gz

There is a small python frontend that will generate real chains, (follow 
the instructions in my previous post about using the python script) - 
also the script can be edited to set the number of threads (cores) you 
wish to use.

An AMD Phenom x4 @ 3.2 GHz makes around 16 chains / second. I suppose 
there is room for assembly optimization here, but that isn't really the 
point. I am writing this code, to look into efficient table lookup once 
that tables have been generated. The idea is to spread the lookup part 
to machines that may not have a GPU.

On my machine the code will cause 32 lookups to disk / second, hardly a 
cause for alarm, so a bog standard hard disk will do.

But if the GPU is used for lookup, the rate will be much higher 
(320/sec) - and I am currently copying sorted tables to a slow USB flash 
drive, to determine of it can keep up with the pace with respect to 
lookup / reads. (It takes some hours to copy 1 million files down to 
this device, so I may change my approach to sorting, to something that 
is more in line with the 64kb block size commonly found on 8GB flash disks

cheers,
  Frank






___
A51 mailing list
A51@lists.reflextor.com
http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51


Re: [A51] CPU implementation (more code)

2009-11-02 Thread Daniel Niggebrugge
Nice work Frank.

Just wondering, did anyone ever look at using SSE2 for faster CPU code 
(or maybe even tried)?
Haven't looked at A5/1 very well, but skipping through A5Cpu.cpp (this 
is the relevant code, right?) the main loop seems to consist of mostly 
bitshifts, xors, ors and ands. That should perform alright in SSE.

Daniël

Frank A. Stevenson schreef:
> I cleaned up the ATI brook code, and made a new version of the shared 
> library, that uses only the CPU for generating chains. The code can be 
> found here (linux only ATM):
>
> http://traxme.net/a5/a5_cpu.tar.gz
>
> There is a small python frontend that will generate real chains, (follow 
> the instructions in my previous post about using the python script) - 
> also the script can be edited to set the number of threads (cores) you 
> wish to use.
>
> An AMD Phenom x4 @ 3.2 GHz makes around 16 chains / second. I suppose 
> there is room for assembly optimization here, but that isn't really the 
> point. I am writing this code, to look into efficient table lookup once 
> that tables have been generated. The idea is to spread the lookup part 
> to machines that may not have a GPU.
>
> On my machine the code will cause 32 lookups to disk / second, hardly a 
> cause for alarm, so a bog standard hard disk will do.
>
> But if the GPU is used for lookup, the rate will be much higher 
> (320/sec) - and I am currently copying sorted tables to a slow USB flash 
> drive, to determine of it can keep up with the pace with respect to 
> lookup / reads. (It takes some hours to copy 1 million files down to 
> this device, so I may change my approach to sorting, to something that 
> is more in line with the 64kb block size commonly found on 8GB flash disks
>
> cheers,
>   Frank
>
>
>
>
>
>
> ___
> A51 mailing list
> A51@lists.reflextor.com
> http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51
>
>   
___
A51 mailing list
A51@lists.reflextor.com
http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51


Re: [A51] CPU implementation (more code)

2009-11-02 Thread sascha
Regarding CPU implementations:

I have attached a 64 wide SIMD implementation (single instruction 64
bits of data)
that does 8 million a5/1 rounds per second on a single core
of a 2ghz core2duo. I am not sure whether -O3 uses SSE automagically,
so i assume not.
I use 64 64bit wide memory locations (so effectively L1 cache) to
store a51 registers vertically, and so i can generate a keystream bit
for 64 independent a5/1 chains with a single set of instructions:

(r1<18>(regs) ^ r2<21>(regs) ^ r3<22>(regs));

so r1<18>(regs) points to a 64bit memory location that stores bit 18
of R1 for 64 A5/1 engines.

the performance bottleneck is the shifting of R1,R2,R3, since you have
to touch each register.

the calculation of the keystream shown above uses 2 xor instructions,
but since you calculate 64bits, effectively only 2/64 xor per
chain is needed.
The shift, which is essentially this:
  for (i = base - length + 1; i < base; ++i) {
regs[i] = regs[i + nshift] & clock2 | regs[i + nshift + 1] & clock3;
  }
uses 6 instructions 64 times (effectively 6 instructions per chain).

with 8M A51 rounds per second (each round calculating 64bit), i can generate
512 million keystream bits per second, at 2ghz that is only 4 clocks per
bit.

it does not yet run very well on NVIDIA though, but i expect the playstation
with 128bit registers to perform very well using this technique.

Maybe someone can port that to SSE or Frank you can try to write
an ATI implementation. if it is possible, those ATI cards would skyrocket,
since with proper code generation this implementation can be made to
store values in registers only and not use shared memory which is scarce
on ATI. But if IIRC, ati gpus have 256 registers per thread, and only
64 * 3 are needed. (64 for R1,R2,R3; 64 for the result; 64 for the round
function values; and some for the calculations)

The handling of the round function is not implemented properly.
If you want to calculate values to check against the reference implementation,
you should calculate like 1000 A5/1 rounds (so the round function does not
change).
With advance 0, where the round function value is 0 and value ^ 0 == value,
the reference implementation also calculates round function-less chains.

> Nice work Frank.
> 
> Just wondering, did anyone ever look at using SSE2 for faster CPU code 
> (or maybe even tried)?
> Haven't looked at A5/1 very well, but skipping through A5Cpu.cpp (this 
> is the relevant code, right?) the main loop seems to consist of mostly 
> bitshifts, xors, ors and ands. That should perform alright in SSE.
> 
> Daniël
> 
> Frank A. Stevenson schreef:
> > I cleaned up the ATI brook code, and made a new version of the shared 
> > library, that uses only the CPU for generating chains. The code can be 
> > found here (linux only ATM):
> >
> > http://traxme.net/a5/a5_cpu.tar.gz
> >
> > There is a small python frontend that will generate real chains, (follow 
> > the instructions in my previous post about using the python script) - 
> > also the script can be edited to set the number of threads (cores) you 
> > wish to use.
> >
> > An AMD Phenom x4 @ 3.2 GHz makes around 16 chains / second. I suppose 
> > there is room for assembly optimization here, but that isn't really the 
> > point. I am writing this code, to look into efficient table lookup once 
> > that tables have been generated. The idea is to spread the lookup part 
> > to machines that may not have a GPU.
> >
> > On my machine the code will cause 32 lookups to disk / second, hardly a 
> > cause for alarm, so a bog standard hard disk will do.
> >
> > But if the GPU is used for lookup, the rate will be much higher 
> > (320/sec) - and I am currently copying sorted tables to a slow USB flash 
> > drive, to determine of it can keep up with the pace with respect to 
> > lookup / reads. (It takes some hours to copy 1 million files down to 
> > this device, so I may change my approach to sorting, to something that 
> > is more in line with the 64kb block size commonly found on 8GB flash disks
> >
> > cheers,
> >   Frank
> >
> >
> >
> >
> >
> >
> > ___
> > A51 mailing list
> > A51@lists.reflextor.com
> > http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51
> >
> >   
> ___
> A51 mailing list
> A51@lists.reflextor.com
> http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51
#include 
#include 
#include 
#include 
//#include 

template 
void vconvert(T * in, T * out, int n, int shift, T mask) {
  T notmask = ~mask;
  for (int i = 0, j = n / 2, outi = 0; i < n / 2; ++i, ++j, outi += 2) {
out[outi] = in[i] & mask | in[j] << shift & notmask;
out[outi + 1] = in[j] & notmask | in[i] >> shift & mask;
  }
}

uint64_t masks[6] = {
  0x,
  0x,
  0x00ff00ff,
  0x0f0f0f0f,
  0x,
  0x
};

template 
T * do_vconvert(T * in, T * out, int n, uint64_t * masks) {
  int shift = 32;
  for (int i = 0; i < 6; +

Re: [A51] CPU implementation (more code)

2009-11-03 Thread Frank A. Stevenson
sascha wrote:
> Regarding CPU implementations:
>
> I have attached a 64 wide SIMD implementation (single instruction 64
> bits of data)
> that does 8 million a5/1 rounds per second on a single core
> of a 2ghz core2duo. I am not sure whether -O3 uses SSE automagically,
> so i assume not.
> I use 64 64bit wide memory locations (so effectively L1 cache) to
> store a51 registers vertically, and so i can generate a keystream bit
> for 64 independent a5/1 chains with a single set of instructions:
>
> (r1<18>(regs) ^ r2<21>(regs) ^ r3<22>(regs));
>
> so r1<18>(regs) points to a 64bit memory location that stores bit 18
> of R1 for 64 A5/1 engines.
>
> the performance bottleneck is the shifting of R1,R2,R3, since you have
> to touch each register.
>
>   
I like this idea (it would be great to make a practical implementation) 
... without giving to much thought about it, it seems that the shifting 
can be avoided by unrolling the loop, and using the 64 registers as a 
ring-buffer. Hurray: more templates :-)

At the end of 64 iterations, you could then  the 15 
lsb-registers together, and inspect the result for 0 bits, 
(distinguising points) - The round functions can be held in another set 
of 64 bits registers/memory holding  and the 
relevant round function(s) can be updated across the registers.

Frank


___
A51 mailing list
A51@lists.reflextor.com
http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51


Re: [A51] CPU implementation (more code)

2009-11-03 Thread Frank A. Stevenson
Frank A. Stevenson wrote:
> I like this idea (it would be great to make a practical implementation) 
> ... without giving to much thought about it, it seems that the shifting 
> can be avoided by unrolling the loop, and using the 64 registers as a 
> ring-buffer.
I obviously did not think to hard about this :-( The amount of shifting
is a problem, especially with the way ATI handles branches. (shifting
only 2 around takes the same amount of time as shifting all 3) - I will
need to think of something cleverer. :-)

Frank


___
A51 mailing list
A51@lists.reflextor.com
http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51


Re: [A51] CPU implementation (more code)

2009-11-03 Thread Sascha Krissler
Forward to the list from PM


> So you now use m128i registers and do instructions on 128 bit wide 

yes

> values? I assume that what you are doing is what is usually called bit 
> slicing?

yes

> For both Cell and SSE2, do you use interlacing of instructions to fill 
> up the pipeline (and usually improve speed by a factor 2 or 3)? (not 
> sure if that really helps on bit sliced calculations though)

i did not pay special attention to the instruction scheduling.
The heavy load loop is unrolled and gcc will have plenty of opportunity
to reorder the instructions.
Also the loop body is executed ~ 650m times per second on a 2ghz core,
effectivly using 3 clocks per loop for the 7 instructions (unrolled)

on the cell spu it is half as fast despite higher clocks, so i should try to 
optimize there.
(but the SPU is also not fully multiscalar, so should be half as fast)

with the SPU being in-order, instruction instruction reordering and scheduling 
may pay off.

This function does 90% of the work:
nshift == 0, lenght= 19 or 22 or 23, RT=ssevector(uint64, uint64)

template 
void lsh_reg(RT * regs, RT clock3) {
  RT clock2 = ~clock3;
  int i;
  for (i = base - length + 1; i < base; ++i) {
regs[i] = regs[i + nshift] & clock2 | regs[i + nshift + 1] & clock3;
  }



Before loop unrolling:

xmm0: clock2
xmm6: ~clock2
720(...): regs[i + nshift]
736(...): regs[i + nshift + 1]
 
L41:
.loc 2 110 0
movdqa  720(%ecx,%eax), %xmm1
movdqa  736(%ecx,%eax), %xmm2
L18:
.loc 2 111 0
pand%xmm0, %xmm1
pand%xmm6, %xmm2
por %xmm2, %xmm1
movdqa  %xmm1, 720(%ecx,%eax)
addl$16, %eax
.loc 2 110 0
cmpl$288, %eax
jne L41

After loop unrolling this pattern repeats:
xmm2, xmm6: clock2, ~clock2, 720(), 736() regs[...]

movdqa  %xmm2, %xmm0
movdqa  %xmm6, %xmm1
pand720(%ebx,%eax), %xmm0
pand736(%ebx,%eax), %xmm1
por %xmm1, %xmm0
movdqa  %xmm0, 720(%ebx,%eax)
leal32(%edx), %eax


__
GRATIS für alle WEB.DE-Nutzer: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://movieflat.web.de

___
A51 mailing list
A51@lists.reflextor.com
http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51


Re: [A51] CPU implementation (more code)

2009-11-03 Thread sascha
> Regarding CPU implementations:
Some good news:

The SSE version achieves 18 chains/second per core of my 2ghz core2duo.
The Cell version runs at 10 chains/second per SPU (3.2ghz).

some polishing has to be done before the code can be released.
Anyone here paying the rent for roadrunner for 24 hours?

___
A51 mailing list
A51@lists.reflextor.com
http://lists.lists.reflextor.com/cgi-bin/mailman/listinfo/a51