Re: zlib NEON improvements

Michael Hope Sun, 01 May 2011 18:00:34 -0700

On Mon, May 2, 2011 at 11:13 AM, Jan Seiffert
<kaffeemons...@googlemail.com> wrote:
> 2011/5/2 Michael Hope <michael.h...@linaro.org>:
> Hi Michael, Linaro Devs
>
>>  I see similar numbers.
>
> Great to hear ;)
> Means i'm not totally on the wrong track


Note that I've sent the results to zlib-dev.  The copy to linaro-dev
was forwarded on rather than cross-posting.

>>  I wasn't sure what you were using to benchmark this
>
> A little program which contains the different code versions and a test loop.
> As it is written there:
> "10000 * 160000 bytes" means 10000 calls with a 160000 bytes buffer.
> The different lines are from tests with buffer + offset, len - offset
> to test for different alignments
>
> The buffer is filled with 0xff (the worst input for adler32, because
> it may overflow the internal sums earlier then other input, so all
> internal looping must be for 0xff).
> The time is measured with times().
>
> Oh, and if it's not clear, this is only the adler32 speedup, because
> only adler32 is run in a loop.

Any idea on how much time in a zlib decompress is spent in adler32?

>> so I wrote my own little stub that did the
>> seed=0x0CB4B676 version over data from rand().
>
> Yeah, that's also a valid test. Maybe you want to srand(0) or
> something to get a reliable result.

Yip, I had srand(1234) so the results should be repeatable.

>> It's interesting how the slower A8 does better than the A9.  It's
>> probably due to the A8 having wider access to the L2 cache as running
>> the same test but on 16 k of data so that it fits in the L1 cache
>> gives:
>>
>> Cortex-A8: 5.234 s
>> Cortex-A9: 3.969 s
>>
>> The ratio here is 0.760 which is very similar to the ratio between the
>> clock frequencies.
>>
>
> Yepp, cache connection is important. Most Vector units are, at least
> for this task, very fast and only constrained by internal brain damage
> or cache/memory.
> Look at the Altivec numbers:
> http://mail.madler.net/pipermail/zlib-devel_madler.net/2011-April/002544.html
> As long as it fits into the cache 6.6 speedup, after that 1.3 speedup.

Makes sense.  The A8 has a direct connection to the 256 k of L2 cache
so the 160k x 10,000 test runs as fast as the 16 k x 100,000 test.

-- Michael

_______________________________________________
linaro-dev mailing list
linaro-dev@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-dev

Re: zlib NEON improvements

Reply via email to