Siarhei Siamashka wrote:

By the way, I tried to search for asm optimized versions of memcpy
for ARM platforms. Did not do that before as my mistake was that I
assumed glibc memcpy/memset implementations to be already optimized
as much as posible.

Appears that there is fast memcpy implementation in uclibc and there
are also much more other implementations around. Seems like I tried
to reinvent the wheel. Too bad if it appears that spending the whole
2 days on weekend was a useless waste of time :( Well, at least I did
not try to steal someone's else code and 'copyright' it.

As I told before, my observations show that it is better to align writes on 16-byte boundaries at least on Nokia 770. The code I have posted is a proof of concept code and it shows that it is faster than
 default memset/memcpy on the device. I'm going to compare my code
with uclibc implementation, if uclibc is in fact faster or has the
same performance, I'll have to apologize for causing this mess and go
away ashamed.

Added uclibc benchmark to the test program:
--- running correctness tests ---
all the correctness tests passed
--- running performance tests (memory bandwidth benchmark) ---:
memset() memory bandwidth: 122.64MB/s
memset_uclibc() memory bandwidth: 121.93MB/s
memset8() memory bandwidth: 279.62MB/s
memcpy() memory bandwidth (perfectly aligned): 102.30MB/s
memcpy_uclibc() memory bandwidth (perfectly aligned): 110.96MB/s
memcpy16() memory bandwidth (perfectly aligned): 110.96MB/s
memcpy() memory bandwidth (16-bit aligned): 69.44MB/s
memcpy_uclibc() memory bandwidth (16-bit aligned): 49.58MB/s
memcpy16() memory bandwidth (16-bit aligned): 99.86MB/s
--- testing performance for random blocks (size 0-15 bytes) ---
memset time: 0.410
memset8 time: 0.270
--- testing performance for random blocks (size 0-511 bytes) ---
memset time: 2.360
memset8 time: 1.140

So while uclibc also uses STM instruction for copying large chunk of
memory at once, it does not use 16-byte alignment and performs quite
poorly on not very aligned data.

It was good that I did not search for other memcpy implementations
first, but tried to make a new one. Beginners luck probably :)
Without looking at other implementations, I just tried different
instructions (including STRD instruction from the new DSP instruction
set), order of instructions and data block sizes in memset32 function
and almost accidently stumbled upon the combination which seems to work
better.

That's not really an 'invention' as there are not many things that can
be variated within a dozen of instructions needed for memset function.
It is strange that such 16-byte alignment trick was neither used in
uclibc nor in glibc until now. One more option is that this improvement
is only Nokia 770 specific and nobody else ever encountered it or had to
use. Well, do we really care anyway? ;)

Now I just really badly want to see the benchmark results from some
other cpu, preferably intel xscale :)


_______________________________________________
maemo-developers mailing list
maemo-developers@maemo.org
https://maemo.org/mailman/listinfo/maemo-developers

Reply via email to