John,

Ive change my benchmark to invalidate the cache before every test. My
result are the same. Attached is my test program.


# ./memtest 4096
libc() memcpy bandwidth (align=0, size=4096):
16.99MB/s
armasm() memcpy bandwidth (align=0, size=4096):
40.96MB/s
armasm2() memcpy bandwidth (align=0, size=4096):
40.96MB/s

libc() memcpy bandwidth (align=1, size=4096):
16.99MB/s
armasm() memcpy bandwidth (align=1, size=4096):
20.29MB/s
armasm2() memcpy bandwidth (align=1, size=4096):
37.49MB/s

libc() memcpy bandwidth (align=2, size=4096):
16.99MB/s
armasm() memcpy bandwidth (align=2, size=4096):
20.29MB/s
armasm2() memcpy bandwidth (align=2, size=4096):
37.62MB/s

libc() memcpy bandwidth (align=3, size=4096):
16.99MB/s
armasm() memcpy bandwidth (align=3, size=4096):
20.29MB/s
armasm2() memcpy bandwidth (align=3, size=4096):
37.49MB/s


Regards,

Vince


On Tue, 2009-03-24 at 23:21 +1000, John Williams wrote:
> Hi,
> 
> I've been watching with half-interest on this thread, and just thought
> I'd throw in a thought I've had.
> 
> Nowhere in this thread does the word "cache" appear - in your
> benchmarks are you invalidating the cache between benchmark runs?  If
> the cache is cold on the first run (which is always the "slower" glibc
> version), and hot on subsequent runs, it will be distorting your
> results.
> 
> Maybe you've factored for this, but I don't think it's been explicitly
> mentioned so far.  It could explain why others are not seeing the same
> dramatic speedups that you are reporting.
> 
> Cheers,
> 
> John
> 
> 2009/3/24 vince <vi...@bluush.com>:
> > Niels,
> >
> > After a closer review of the code, I found that unaligned copy were a
> > lot slower them aligned 1s. Ive created an other version of the routine
> > that will take take of that. Attached to this email, you will find a
> > simple program that I used to test this code. This program will test
> > both aligned and unaligned (src & dst) of the 3 diff implementation
> > (libc memcpy, rev1 armasm memcpy, and rev2 armasm memcpy).
> >
> > Here is the output of the program running on an arm9 AT91RM9200 using
> > uClibc-0.9.30 and gcc-4.2.4:
> > armasm is rev1, and armasm2 is rev2
> >
> > # ./memtest 500000
> > 32bit src/dst Aligned test:
> > Testing libc (0x4005a008 <==> 0x40243008 : 500000):
> > 2.996949 sec
> > Testing armasm (0x4005a008 <==> 0x40243008 : 500000):
> > 1.331787 sec
> > Testing armasm2 (0x4005a008 <==> 0x40243008 : 500000):
> > 1.358246 sec
> > The faster routine is armasm
> >
> > 16bit src/dst Aligned test:
> > Testing libc (0x4005a00a <==> 0x4024300a : 500000):
> > 2.983215 sec
> > Testing armasm (0x4005a00a <==> 0x4024300a : 500000):
> > 1.332214 sec
> > Testing armasm2 (0x4005a00a <==> 0x4024300a : 500000):
> > 1.358978 sec
> > The faster routine is armasm
> >
> > 8bit src/dst Aligned test:
> > Testing libc (0x4005a009 <==> 0x40243009 : 500000):
> > 2.982209 sec
> > Testing armasm (0x4005a009 <==> 0x40243009 : 500000):
> > 1.331054 sec
> > Testing armasm2 (0x4005a009 <==> 0x40243009 : 500000):
> > 1.359162 sec
> > The faster routine is armasm
> >
> > 16bit src Aligned test:
> > Testing libc (0x4005a00a <==> 0x40243008 : 500000):
> > 2.983734 sec
> > Testing armasm (0x4005a00a <==> 0x40243008 : 500000):
> > 2.571228 sec
> > Testing armasm2 (0x4005a00a <==> 0x40243008 : 500000):
> > 1.419556 sec
> > The faster routine is armasm2
> >
> > 8bit src Aligned test:
> > Testing libc (0x4005a009 <==> 0x40243008 : 500000):
> > 2.984101 sec
> > Testing armasm (0x4005a009 <==> 0x40243008 : 500000):
> > 2.570343 sec
> > Testing armasm2 (0x4005a009 <==> 0x40243008 : 500000):
> > 1.419525 sec
> > The faster routine is armasm2
> >
> > 16bit dst Aligned test:
> > Testing libc (0x4005a008 <==> 0x4024300a : 500000):
> > 2.983948 sec
> > Testing armasm (0x4005a008 <==> 0x4024300a : 500000):
> > 2.571563 sec
> > Testing armasm2 (0x4005a008 <==> 0x4024300a : 500000):
> > 1.418671 sec
> > The faster routine is armasm2
> >
> > 8bit dst Aligned test:
> > Testing libc (0x4005a008 <==> 0x40243009 : 500000):
> > 2.983521 sec
> > Testing armasm (0x4005a008 <==> 0x40243009 : 500000):
> > 2.571258 sec
> > Testing armasm2 (0x4005a008 <==> 0x40243009 : 500000):
> > 1.418762 sec
> > The faster routine is armasm2
> >
> >
> > As you can see, rev2 works a lot better with unaligned buffers. I will
> > update the patch to DirectFB to include this new version of the routine.
> >
> >
> > As for the big-endian, this version will ONLY work with little-endian,
> > so a config directive will need to be set for the build to work on those
> > targets. I will include that in the patch.
> >
> > For now, it would be great if I could get some metrics from people to
> > double check my result.
> >
> > Regards,
> >
> > Vince
> >
> >
> >
> >
> > On Mon, 2009-03-23 at 16:36 +0100, Niels Roest wrote:
> >> Hi Vince,
> >> I'm happy to include the patch,
> >> I just have a few unclarities, hope somebody can clear them..
> >>
> >> (1) memcpy is speed tested with (I think) aligned accesses (based on
> >> D_MALLOC adresses) but I think we'll see a lot of unaligned memcpy's
> >> too, but that side of the implementation looks kinda weak.. Anyone care
> >> to give some figures for unaligned copy? Have a look at
> >> direct_find_best_memcpy() in lib/direct/memcpy.c, and fidget a bit with
> >> buf1 and buf2.
> >> (2) what happens on a big-endian ARM if I just include the patch? Having
> >> trouble finding this dependancy in the patch.. Will need to fix this, or
> >> put a show stopper somewhere for big-endian, so the patch doesn't break
> >> something.
> >>
> >> Greets
> >> Niels
> >>
> >> vince wrote:
> >> > Hello,
> >> >
> >> > Ive been working on trying to improve the performance of directfb 1.3.0
> >> > on the arm platform. The attached patch will replace the default libc
> >> > memcpy with a faster implementation. Ive tested this patch using an
> >> > AT91RM9200, but should work on other ARM targets.
> >> >
> >> > Hope this will be useful to others.
> >> >
> >> > Regards,
> >> >
> >> > Vince
> >> >
> >> > ------------------------------------------------------------------------
> >> >
> >> > _______________________________________________
> >> > directfb-dev mailing list
> >> > directfb-dev@directfb.org
> >> > http://mail.directfb.org/cgi-bin/mailman/listinfo/directfb-dev
> >>
> >>
> >
> > _______________________________________________
> > directfb-dev mailing list
> > directfb-dev@directfb.org
> > http://mail.directfb.org/cgi-bin/mailman/listinfo/directfb-dev
> >
> >
> 
> 
> 

Attachment: memtest.tar.bz2
Description: application/bzip-compressed-tar

_______________________________________________
directfb-dev mailing list
directfb-dev@directfb.org
http://mail.directfb.org/cgi-bin/mailman/listinfo/directfb-dev

Reply via email to