On Mon, May 2, 2011 at 11:13 AM, Jan Seiffert <kaffeemons...@googlemail.com> wrote: > 2011/5/2 Michael Hope <michael.h...@linaro.org>: > Hi Michael, Linaro Devs > >> I see similar numbers. > > Great to hear ;) > Means i'm not totally on the wrong track
Note that I've sent the results to zlib-dev. The copy to linaro-dev was forwarded on rather than cross-posting. >> I wasn't sure what you were using to benchmark this > > A little program which contains the different code versions and a test loop. > As it is written there: > "10000 * 160000 bytes" means 10000 calls with a 160000 bytes buffer. > The different lines are from tests with buffer + offset, len - offset > to test for different alignments > > The buffer is filled with 0xff (the worst input for adler32, because > it may overflow the internal sums earlier then other input, so all > internal looping must be for 0xff). > The time is measured with times(). > > Oh, and if it's not clear, this is only the adler32 speedup, because > only adler32 is run in a loop. Any idea on how much time in a zlib decompress is spent in adler32? >> so I wrote my own little stub that did the >> seed=0x0CB4B676 version over data from rand(). > > Yeah, that's also a valid test. Maybe you want to srand(0) or > something to get a reliable result. Yip, I had srand(1234) so the results should be repeatable. >> It's interesting how the slower A8 does better than the A9. It's >> probably due to the A8 having wider access to the L2 cache as running >> the same test but on 16 k of data so that it fits in the L1 cache >> gives: >> >> Cortex-A8: 5.234 s >> Cortex-A9: 3.969 s >> >> The ratio here is 0.760 which is very similar to the ratio between the >> clock frequencies. >> > > Yepp, cache connection is important. Most Vector units are, at least > for this task, very fast and only constrained by internal brain damage > or cache/memory. > Look at the Altivec numbers: > http://mail.madler.net/pipermail/zlib-devel_madler.net/2011-April/002544.html > As long as it fits into the cache 6.6 speedup, after that 1.3 speedup. Makes sense. The A8 has a direct connection to the 256 k of L2 cache so the 160k x 10,000 test runs as fast as the 16 k x 100,000 test. -- Michael _______________________________________________ linaro-dev mailing list linaro-dev@lists.linaro.org http://lists.linaro.org/mailman/listinfo/linaro-dev