Hi,

I've been watching with half-interest on this thread, and just thought
I'd throw in a thought I've had.

Nowhere in this thread does the word "cache" appear - in your
benchmarks are you invalidating the cache between benchmark runs?  If
the cache is cold on the first run (which is always the "slower" glibc
version), and hot on subsequent runs, it will be distorting your
results.

Maybe you've factored for this, but I don't think it's been explicitly
mentioned so far.  It could explain why others are not seeing the same
dramatic speedups that you are reporting.

Cheers,

John

2009/3/24 vince <vi...@bluush.com>:
> Niels,
>
> After a closer review of the code, I found that unaligned copy were a
> lot slower them aligned 1s. Ive created an other version of the routine
> that will take take of that. Attached to this email, you will find a
> simple program that I used to test this code. This program will test
> both aligned and unaligned (src & dst) of the 3 diff implementation
> (libc memcpy, rev1 armasm memcpy, and rev2 armasm memcpy).
>
> Here is the output of the program running on an arm9 AT91RM9200 using
> uClibc-0.9.30 and gcc-4.2.4:
> armasm is rev1, and armasm2 is rev2
>
> # ./memtest 500000
> 32bit src/dst Aligned test:
> Testing libc (0x4005a008 <==> 0x40243008 : 500000):
> 2.996949 sec
> Testing armasm (0x4005a008 <==> 0x40243008 : 500000):
> 1.331787 sec
> Testing armasm2 (0x4005a008 <==> 0x40243008 : 500000):
> 1.358246 sec
> The faster routine is armasm
>
> 16bit src/dst Aligned test:
> Testing libc (0x4005a00a <==> 0x4024300a : 500000):
> 2.983215 sec
> Testing armasm (0x4005a00a <==> 0x4024300a : 500000):
> 1.332214 sec
> Testing armasm2 (0x4005a00a <==> 0x4024300a : 500000):
> 1.358978 sec
> The faster routine is armasm
>
> 8bit src/dst Aligned test:
> Testing libc (0x4005a009 <==> 0x40243009 : 500000):
> 2.982209 sec
> Testing armasm (0x4005a009 <==> 0x40243009 : 500000):
> 1.331054 sec
> Testing armasm2 (0x4005a009 <==> 0x40243009 : 500000):
> 1.359162 sec
> The faster routine is armasm
>
> 16bit src Aligned test:
> Testing libc (0x4005a00a <==> 0x40243008 : 500000):
> 2.983734 sec
> Testing armasm (0x4005a00a <==> 0x40243008 : 500000):
> 2.571228 sec
> Testing armasm2 (0x4005a00a <==> 0x40243008 : 500000):
> 1.419556 sec
> The faster routine is armasm2
>
> 8bit src Aligned test:
> Testing libc (0x4005a009 <==> 0x40243008 : 500000):
> 2.984101 sec
> Testing armasm (0x4005a009 <==> 0x40243008 : 500000):
> 2.570343 sec
> Testing armasm2 (0x4005a009 <==> 0x40243008 : 500000):
> 1.419525 sec
> The faster routine is armasm2
>
> 16bit dst Aligned test:
> Testing libc (0x4005a008 <==> 0x4024300a : 500000):
> 2.983948 sec
> Testing armasm (0x4005a008 <==> 0x4024300a : 500000):
> 2.571563 sec
> Testing armasm2 (0x4005a008 <==> 0x4024300a : 500000):
> 1.418671 sec
> The faster routine is armasm2
>
> 8bit dst Aligned test:
> Testing libc (0x4005a008 <==> 0x40243009 : 500000):
> 2.983521 sec
> Testing armasm (0x4005a008 <==> 0x40243009 : 500000):
> 2.571258 sec
> Testing armasm2 (0x4005a008 <==> 0x40243009 : 500000):
> 1.418762 sec
> The faster routine is armasm2
>
>
> As you can see, rev2 works a lot better with unaligned buffers. I will
> update the patch to DirectFB to include this new version of the routine.
>
>
> As for the big-endian, this version will ONLY work with little-endian,
> so a config directive will need to be set for the build to work on those
> targets. I will include that in the patch.
>
> For now, it would be great if I could get some metrics from people to
> double check my result.
>
> Regards,
>
> Vince
>
>
>
>
> On Mon, 2009-03-23 at 16:36 +0100, Niels Roest wrote:
>> Hi Vince,
>> I'm happy to include the patch,
>> I just have a few unclarities, hope somebody can clear them..
>>
>> (1) memcpy is speed tested with (I think) aligned accesses (based on
>> D_MALLOC adresses) but I think we'll see a lot of unaligned memcpy's
>> too, but that side of the implementation looks kinda weak.. Anyone care
>> to give some figures for unaligned copy? Have a look at
>> direct_find_best_memcpy() in lib/direct/memcpy.c, and fidget a bit with
>> buf1 and buf2.
>> (2) what happens on a big-endian ARM if I just include the patch? Having
>> trouble finding this dependancy in the patch.. Will need to fix this, or
>> put a show stopper somewhere for big-endian, so the patch doesn't break
>> something.
>>
>> Greets
>> Niels
>>
>> vince wrote:
>> > Hello,
>> >
>> > Ive been working on trying to improve the performance of directfb 1.3.0
>> > on the arm platform. The attached patch will replace the default libc
>> > memcpy with a faster implementation. Ive tested this patch using an
>> > AT91RM9200, but should work on other ARM targets.
>> >
>> > Hope this will be useful to others.
>> >
>> > Regards,
>> >
>> > Vince
>> >
>> > ------------------------------------------------------------------------
>> >
>> > _______________________________________________
>> > directfb-dev mailing list
>> > directfb-dev@directfb.org
>> > http://mail.directfb.org/cgi-bin/mailman/listinfo/directfb-dev
>>
>>
>
> _______________________________________________
> directfb-dev mailing list
> directfb-dev@directfb.org
> http://mail.directfb.org/cgi-bin/mailman/listinfo/directfb-dev
>
>



-- 
John Williams, PhD, B.Eng, B.IT
PetaLogix - Linux Solutions for a Reconfigurable World
w: www.petalogix.com  p: +61-7-30090663  f: +61-7-30090663
_______________________________________________
directfb-dev mailing list
directfb-dev@directfb.org
http://mail.directfb.org/cgi-bin/mailman/listinfo/directfb-dev

Reply via email to