Did you consider the other alternative?  If you work on 32-bit chunks
instead of 64-bit chunks (either load them with lwz, or split them
after loading with ld), you can add them up with a regular non-carrying
add, which isn't serialising like adde; this also allows unrolling the
loop (using several accumulators instead of just one).  Since your
registers are 64-bit, you can sum 16GB of data before ever getting a
carry out.

Or maybe the bottleneck here is purely the memory bandwidth?
I think the main bottleneck is the bandwidth/latency of memory.

When I sent the patch out I hadn't thought about eliminating the e from the add with 32 bit chunks. So I went off and tried it today and converting the existing function to use just add instead of adde (since it was only doing 32 bits already) and got 1.5% - 15.7% faster on Power5, which is nice, but was still way behind the new function in every testcase. I then added 1 level of unrolling to that (using 2 accumulators) and got 59% slower to 10% faster on Power5 depending on input. It seems quite a bit slower than I would have expected (I would have expected basically even), but thats what got measured. The comment in the existing function indicates unrolling the loop doesn't help because the bdnz has zero overhead, so I guess the unrolling hurt more than I expected.

In any case I have now thought about it and don't think it will work out.


Signed-off-by: Joel Schopp<[EMAIL PROTECTED]>

You missed a space there.
If at first you don't succeed...

Signed-off-by: Joel Schopp <[EMAIL PROTECTED]>
_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev

Reply via email to