John, Ive change my benchmark to invalidate the cache before every test. My result are the same. Attached is my test program.
# ./memtest 4096 libc() memcpy bandwidth (align=0, size=4096): 16.99MB/s armasm() memcpy bandwidth (align=0, size=4096): 40.96MB/s armasm2() memcpy bandwidth (align=0, size=4096): 40.96MB/s libc() memcpy bandwidth (align=1, size=4096): 16.99MB/s armasm() memcpy bandwidth (align=1, size=4096): 20.29MB/s armasm2() memcpy bandwidth (align=1, size=4096): 37.49MB/s libc() memcpy bandwidth (align=2, size=4096): 16.99MB/s armasm() memcpy bandwidth (align=2, size=4096): 20.29MB/s armasm2() memcpy bandwidth (align=2, size=4096): 37.62MB/s libc() memcpy bandwidth (align=3, size=4096): 16.99MB/s armasm() memcpy bandwidth (align=3, size=4096): 20.29MB/s armasm2() memcpy bandwidth (align=3, size=4096): 37.49MB/s Regards, Vince On Tue, 2009-03-24 at 23:21 +1000, John Williams wrote: > Hi, > > I've been watching with half-interest on this thread, and just thought > I'd throw in a thought I've had. > > Nowhere in this thread does the word "cache" appear - in your > benchmarks are you invalidating the cache between benchmark runs? If > the cache is cold on the first run (which is always the "slower" glibc > version), and hot on subsequent runs, it will be distorting your > results. > > Maybe you've factored for this, but I don't think it's been explicitly > mentioned so far. It could explain why others are not seeing the same > dramatic speedups that you are reporting. > > Cheers, > > John > > 2009/3/24 vince <vi...@bluush.com>: > > Niels, > > > > After a closer review of the code, I found that unaligned copy were a > > lot slower them aligned 1s. Ive created an other version of the routine > > that will take take of that. Attached to this email, you will find a > > simple program that I used to test this code. This program will test > > both aligned and unaligned (src & dst) of the 3 diff implementation > > (libc memcpy, rev1 armasm memcpy, and rev2 armasm memcpy). > > > > Here is the output of the program running on an arm9 AT91RM9200 using > > uClibc-0.9.30 and gcc-4.2.4: > > armasm is rev1, and armasm2 is rev2 > > > > # ./memtest 500000 > > 32bit src/dst Aligned test: > > Testing libc (0x4005a008 <==> 0x40243008 : 500000): > > 2.996949 sec > > Testing armasm (0x4005a008 <==> 0x40243008 : 500000): > > 1.331787 sec > > Testing armasm2 (0x4005a008 <==> 0x40243008 : 500000): > > 1.358246 sec > > The faster routine is armasm > > > > 16bit src/dst Aligned test: > > Testing libc (0x4005a00a <==> 0x4024300a : 500000): > > 2.983215 sec > > Testing armasm (0x4005a00a <==> 0x4024300a : 500000): > > 1.332214 sec > > Testing armasm2 (0x4005a00a <==> 0x4024300a : 500000): > > 1.358978 sec > > The faster routine is armasm > > > > 8bit src/dst Aligned test: > > Testing libc (0x4005a009 <==> 0x40243009 : 500000): > > 2.982209 sec > > Testing armasm (0x4005a009 <==> 0x40243009 : 500000): > > 1.331054 sec > > Testing armasm2 (0x4005a009 <==> 0x40243009 : 500000): > > 1.359162 sec > > The faster routine is armasm > > > > 16bit src Aligned test: > > Testing libc (0x4005a00a <==> 0x40243008 : 500000): > > 2.983734 sec > > Testing armasm (0x4005a00a <==> 0x40243008 : 500000): > > 2.571228 sec > > Testing armasm2 (0x4005a00a <==> 0x40243008 : 500000): > > 1.419556 sec > > The faster routine is armasm2 > > > > 8bit src Aligned test: > > Testing libc (0x4005a009 <==> 0x40243008 : 500000): > > 2.984101 sec > > Testing armasm (0x4005a009 <==> 0x40243008 : 500000): > > 2.570343 sec > > Testing armasm2 (0x4005a009 <==> 0x40243008 : 500000): > > 1.419525 sec > > The faster routine is armasm2 > > > > 16bit dst Aligned test: > > Testing libc (0x4005a008 <==> 0x4024300a : 500000): > > 2.983948 sec > > Testing armasm (0x4005a008 <==> 0x4024300a : 500000): > > 2.571563 sec > > Testing armasm2 (0x4005a008 <==> 0x4024300a : 500000): > > 1.418671 sec > > The faster routine is armasm2 > > > > 8bit dst Aligned test: > > Testing libc (0x4005a008 <==> 0x40243009 : 500000): > > 2.983521 sec > > Testing armasm (0x4005a008 <==> 0x40243009 : 500000): > > 2.571258 sec > > Testing armasm2 (0x4005a008 <==> 0x40243009 : 500000): > > 1.418762 sec > > The faster routine is armasm2 > > > > > > As you can see, rev2 works a lot better with unaligned buffers. I will > > update the patch to DirectFB to include this new version of the routine. > > > > > > As for the big-endian, this version will ONLY work with little-endian, > > so a config directive will need to be set for the build to work on those > > targets. I will include that in the patch. > > > > For now, it would be great if I could get some metrics from people to > > double check my result. > > > > Regards, > > > > Vince > > > > > > > > > > On Mon, 2009-03-23 at 16:36 +0100, Niels Roest wrote: > >> Hi Vince, > >> I'm happy to include the patch, > >> I just have a few unclarities, hope somebody can clear them.. > >> > >> (1) memcpy is speed tested with (I think) aligned accesses (based on > >> D_MALLOC adresses) but I think we'll see a lot of unaligned memcpy's > >> too, but that side of the implementation looks kinda weak.. Anyone care > >> to give some figures for unaligned copy? Have a look at > >> direct_find_best_memcpy() in lib/direct/memcpy.c, and fidget a bit with > >> buf1 and buf2. > >> (2) what happens on a big-endian ARM if I just include the patch? Having > >> trouble finding this dependancy in the patch.. Will need to fix this, or > >> put a show stopper somewhere for big-endian, so the patch doesn't break > >> something. > >> > >> Greets > >> Niels > >> > >> vince wrote: > >> > Hello, > >> > > >> > Ive been working on trying to improve the performance of directfb 1.3.0 > >> > on the arm platform. The attached patch will replace the default libc > >> > memcpy with a faster implementation. Ive tested this patch using an > >> > AT91RM9200, but should work on other ARM targets. > >> > > >> > Hope this will be useful to others. > >> > > >> > Regards, > >> > > >> > Vince > >> > > >> > ------------------------------------------------------------------------ > >> > > >> > _______________________________________________ > >> > directfb-dev mailing list > >> > directfb-dev@directfb.org > >> > http://mail.directfb.org/cgi-bin/mailman/listinfo/directfb-dev > >> > >> > > > > _______________________________________________ > > directfb-dev mailing list > > directfb-dev@directfb.org > > http://mail.directfb.org/cgi-bin/mailman/listinfo/directfb-dev > > > > > > >
memtest.tar.bz2
Description: application/bzip-compressed-tar
_______________________________________________ directfb-dev mailing list directfb-dev@directfb.org http://mail.directfb.org/cgi-bin/mailman/listinfo/directfb-dev