On Fri, Feb 24, 2017 at 10:30 PM, Bruce Momjian <br...@momjian.us> wrote:
> On Fri, Feb 24, 2017 at 10:09:50PM +0200, Ants Aasma wrote:
>> On Fri, Feb 24, 2017 at 9:37 PM, Bruce Momjian <br...@momjian.us> wrote:
>> > Oh, that's why we will hopefully eventually change the page checksum
>> > algorithm to use the special CRC32 instruction, and set a new checksum
>> > version --- got it.  I assume there is currently no compile-time way to
>> > do this.
>>
>> Using CRC32 as implemented now for the WAL would be significantly
>> slower than what we have now due to instruction latency. Even the best
>> theoretical implementation using the CRC32 instruction would still be
>> about the same speed than what we have now. I haven't seen anybody
>> working on swapping out the current algorithm. And I don't really see
>> a reason to, it would introduce a load of headaches for no real gain.
>
> Uh, I am confused.  I thought you said we were leaving some performance
> on the table.  What is that?   I though CRC32 was SSE4.1.  Why is CRC32
> good for the WAL but bad for the page checksums?  What about the WAL
> page images?

The page checksum algorithm was designed to take advantage of CPUs
that provide vectorized 32bit integer multiplication. On x86 this was
introduced with SSE4.1 extensions. This means that by default we can't
take advantage of the design. The code is written in a way that
compiler auto vectorization works on it, so only using appropriate
compilation flags are needed to compile a version that does use vector
instructions. However to enable it on generic builds, a runtime switch
between different levels of vectorization support is needed. This is
what is leaving the performance on the table.

The page checksum algorithm we have is extremely fast, memcpy fast.
Even without vectorization it is right up there with Murmurhash3a and
xxHash. With vectorization it's 4x faster. And it works this fast on
most modern CPUs, not only Intel. The downside is that it only works
well for large blocks, and only fixed power-of-2 size with the current
implementation. WAL page images have the page hole removed so can't
easily take advantage of this.

That said, I haven't really seen either the hardware accelerated CRC32
calculation nor the non-vectorized page checksum take a noticeable
amount of time on real world workloads. The benchmarks presented in
this thread seem to corroborate this observation.

Regards,
Ants Aasma


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to