On 11/19/2014 05:58 PM, Abhijit Menon-Sen wrote:
At 2014-11-11 16:56:00 +0530, a...@2ndquadrant.com wrote:

I'm working on this (first speeding up the default calculation using
slice-by-N, then adding support for the SSE4.2 CRC instruction on
top).

I've done the first part in the attached patch, and I'm working on the
second (especially the bits to issue CPUID at startup and decide which
implementation to use).

As a benchmark, I ran pg_xlogdump --stats against 11GB of WAL data (674
segments) generated by running a total of 2M pgbench transactions on a
db initialised with scale factor 25.

That's an interesting choice of workload. That sure is heavy on the CRC calculation, but the speed of pg_xlogdump hardly matters in real life.

With HEAD's CRC code:

bin/pg_xlogdump --stats wal/000000010000000000000001   29.81s user 3.56s system 
77% cpu 43.274 total
bin/pg_xlogdump --stats wal/000000010000000000000001   29.59s user 3.85s system 
75% cpu 44.227 total

With slice-by-4 (a minor variant of the attached patch; the results are
included only for curiosity's sake, but I can post the code if needed):

bin/pg_xlogdump --stats wal/000000010000000000000001   13.52s user 3.82s system 
48% cpu 35.808 total
bin/pg_xlogdump --stats wal/000000010000000000000001   13.34s user 3.96s system 
48% cpu 35.834 total

With slice-by-8 (i.e. the attached patch):

bin/pg_xlogdump --stats wal/000000010000000000000001   7.88s user 3.96s system 
34% cpu 34.414 total
bin/pg_xlogdump --stats wal/000000010000000000000001   7.85s user 4.10s system 
34% cpu 35.001 total

(Note the progressive reduction in user time from ~29s to ~8s.)

Finally, just for comparison, here's what happens when we use the
hardware instruction via gcc's __builtin_ia32_crc32xx intrinsics
(i.e. the additional patch I'm working on):

bin/pg_xlogdump --stats wal/000000010000000000000001   3.33s user 4.79s system 
23% cpu 34.832 total

Impressive!

It would be good to see separate benchmarks on WAL generation, and WAL replay. pg_xlogdump is probably close to WAL replay, but the WAL generation case is somewhat different, as WAL is generated in small pieces, and each piece is accumulated to the CRC separately. The slice-by-X algorithm might be less effective in that scenario. I have no doubt that it's still a lot faster than the byte-at-a-time operation, but would be nice to have numbers on it.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to