On 22/10/2015 17:17, Pádraig Brady wrote:
>> > Nice trick indeed.  On the other hand, the first 16 bytes are enough to
>> > rule out 99.99% (number out of thin hair) of the non-zero blocks, so
>> > that's where you want to optimize.  Checking them an unsigned long at a
>> > time, or fetching a few unsigned longs and ORing them together would
>> > probably be the best of both worlds, because you then only use the FPU
>> > in the rare case of a zero buffer.
> Note the above does break early if non zero detected in first 16 bytes.

Yes, but it loops unnecessarily if the non-zero byte is the third or fourth.

> Also I suspect the extra conditions involved in using longs
> for just the first 16 bytes would outweigh the benefits?

Only if your machine cannot do unaligned loads.  If it can, you can
align the length instead of the buffer.  memcmp will take care of
aligning the buffer (with some luck it won't have to, e.g. if buf is
0x12340002 and length = 4094).  On x86 unaligned "unsigned long" loads
are basically free as long as they don't cross a cache line.

> BTW Rusty has a benchmark framework for this as referenced from:
> http://rusty.ozlabs.org/?p=560

I missed his benchmark framework so I wrote another one, here it is:
https://gist.githubusercontent.com/bonzini/9a95b0e02d1ceb60af9e/raw/7bc42ddccdb6c42fea3db58e0539d0443d0e6dc6/memeqzero.c

Paolo

Reply via email to