Attached is a new version. It's rebased on current git master, including BRIN. I've also fixed the laundry list of small things you reported, as well as a bunch of bugs I uncovered during my own testing.

Alvaro: you still have the BRIN WAL-logging code fresh in your memory, so could you take a look at that part of this patch to check that I didn't break it? And also, how do you like the new way of writing that, over what you had to in HEAD ?

More below.

On 11/09/2014 11:47 PM, Andres Freund wrote:
On 2014-11-06 17:32:33 +0200, Heikki Linnakangas wrote:
Replying to some of your comments below. The rest were trivial issues that
I'll just fix.

On 10/30/2014 09:19 PM, Andres Freund wrote:
* Is it really a good idea to separate XLogRegisterBufData() from
   XLogRegisterBuffer() the way you've done it ? If we ever actually get
   a record with a large numbers of blocks touched this issentially is
   O(touched_buffers*data_entries).

Are you worried that the linear search in XLogRegisterBufData(), to find the
right registered_buffer struct, might be expensive? If that ever becomes a
problem, a simple fix would be to start the linear search from the end, and
make sure that when you touch a large number of blocks, you do all the
XLogRegisterBufData() calls right after the corresponding
XLogRegisterBuffer() call.

Yes, that was what I was (mildly) worried about. Since you specified a
high limit of buffers I'm sure someone will come up with a use case for
it ;)

I've also though about having XLogRegisterBuffer() return the pointer to the
struct, and passing it as argument to XLogRegisterBufData. So the pattern in
WAL generating code would be like:

registered_buffer *buf0;

buf0 = XLogRegisterBuffer(0, REGBUF_STANDARD);
XLogRegisterBufData(buf0, data, length);

registered_buffer would be opaque to the callers. That would have potential
to turn XLogRegisterBufData into a macro or inline function, to eliminate
the function call overhead. I played with that a little bit, but the
difference in performance was so small that it didn't seem worth it. But
passing the registered_buffer pointer, like above, might not be a bad thing
anyway.

Yes, that was roughly what I was thinking of as well. It's not all that
pretty, but it generally does seem like a good idea to me anyay.

I ended up doing something different: The block_id is now used as the index into the registered_buffers array. So looking up the right struct is now just &registered_buffers[block_id]. That obviously gets rid of the linear search. It doesn't seem to make any difference in the quick performance testing I've been doing.

* There's lots of functions like XLogRecHasBlockRef() that dig through
   the wal record. A common pattern is something like:
if (XLogRecHasBlockRef(record, 1))
     XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk);
else
     oldblk = newblk;

   I think doing that repeatedly is quite a bad idea. We should parse the
   record once and then use it in a sensible format. Not do it in pieces,
   over and over again. It's not like we ignore backup blocks - so doing
   this lazily doesn't make sense to me.
   Especially as ValidXLogRecord() *already* has parsed the whole damn
   thing once.

Hmm. Adding some kind of a parsed XLogRecord representation would need a
fair amount of new infrastructure.

True.

Vast majority of WAL records contain one,
maybe two, block references, so it's not that expensive to find the right
one, even if you do it several times.

I'm not convinced. It's not an infrequent thing these days to hear
people being bottlenecked by replay. And grovelling repeatedly through
larger records isn't *that* cheap.

I haven't done anything about this. We might want to do that, but I'd like to get this committed now, with all the changes to the AM's, and then spend some time trying to further optimize the WAL format. I might add the "deparsed WAL record" infrastructure as part of that optimization work.

I ran a quick performance test on WAL replay performance yesterday. I ran
pgbench for 1000000 transactions with WAL archiving enabled, and measured
the time it took to replay the generated WAL. I did that with and without
the patch, and I didn't see any big difference in replay times. I also ran
"perf" on the startup process, and the profiles looked identical. I'll do
more comprehensive testing later, with different index types, but I'm
convinced that there is no performance issue in replay that we'd need to
worry about.

Interesting. What checkpoint_segments/timeout and what scale did you
use? Since that heavily influences the average size of the record that's
quite relevant...

Scale factor 5, checkpoint_segments=10, checkpoint_timeout=30s. pg_xlogdump --stats says:

Type                                           N      (%)          Record size  
    (%)             FPI size      (%)        Combined size      (%)
----                                           -      ---          -----------  
    ---             --------      ---        -------------      ---
XLOG                                        1434 (  0.01)                48912 
(  0.00)             10149668 (  0.42)             10198580 (  0.29)
Transaction                              4000767 ( 16.24)            176048028 
( 16.75)                    0 (  0.00)            176048028 (  5.04)
Storage                                      232 (  0.00)                11136 
(  0.00)                    0 (  0.00)                11136 (  0.00)
CLOG                                         123 (  0.00)                 4428 
(  0.00)                    0 (  0.00)                 4428 (  0.00)
Database                                       2 (  0.00)                   96 
(  0.00)                    0 (  0.00)                   96 (  0.00)
Tablespace                                     0 (  0.00)                    0 
(  0.00)                    0 (  0.00)                    0 (  0.00)
MultiXact                                      0 (  0.00)                    0 
(  0.00)                    0 (  0.00)                    0 (  0.00)
RelMap                                        14 (  0.00)                 7784 
(  0.00)                    0 (  0.00)                 7784 (  0.00)
Standby                                       94 (  0.00)                 6016 
(  0.00)                    0 (  0.00)                 6016 (  0.00)
Heap2                                    3563327 ( 14.47)            142523126 
( 13.56)           1332675761 ( 54.52)           1475198887 ( 42.20)
Heap                                    16880705 ( 68.54)            726291821 
( 69.09)            972350846 ( 39.78)           1698642667 ( 48.59)
Btree                                     182603 (  0.74)              6342386 
(  0.60)            129238796 (  5.29)            135581182 (  3.88)
Hash                                           0 (  0.00)                    0 
(  0.00)                    0 (  0.00)                    0 (  0.00)
Gin                                            0 (  0.00)                    0 
(  0.00)                    0 (  0.00)                    0 (  0.00)
Gist                                           0 (  0.00)                    0 
(  0.00)                    0 (  0.00)                    0 (  0.00)
Sequence                                       0 (  0.00)                    0 
(  0.00)                    0 (  0.00)                    0 (  0.00)
SPGist                                         0 (  0.00)                    0 
(  0.00)                    0 (  0.00)                    0 (  0.00)
BRIN                                           0 (  0.00)                    0 
(  0.00)                    0 (  0.00)                    0 (  0.00)
                                        --------                      --------  
                    --------                      --------
Total                                   24629301                    1051283733 
[30.07%]           2444415071 [69.93%]           3495698804 [100%]

So full-page writes indeed form most of the volume.

I run this again with full_page_writes=off. With that, the patched version generated about 3% more WAL than master. Replay seems to be a few percentage points slower, too, which can be explained by the increased WAL volume.

Attached is a shell script I used to create the WAL.


My next steps for this are:

1. Do more correctness testing of WAL replay routines. I've spent some time looking at the coverage report generated by "make coverage-html", after running "make installcheck" on a master-server setup, and crafting test cases to hit the redo routines that are not otherwise covered by our regression tests. That has already uncovered a couple of bugs in the patch that I've now fixed. (That's also how I found the bug I just fixed in 1961b1c1)

2. Reduce the WAL volume. Any increase in WAL volume is painful on its own, and also reduces performance in general. This seems to add 1-5%, depending on the workload. There's some padding bytes the WAL format that I could use. I've refrained from using them up for now, because I consider that cheating: we could have kept the current WAL format and used the padding bytes to make that even smaller/faster. But if we accept that this patch is worth it, even if it leads to some WAL bloat, we'll want to buy that back if we can so that users won't see a regression.

If I can find some ways to reduce the WAL volume, I might still commit this patch more or less as it is first, and commit the optimizations separately later. Depends on how what those optimizations look like.

- Heikki

Attachment: wal-format-and-api-changes-7.patch.gz
Description: application/gzip

Attachment: replay-perf-test.sh
Description: application/shellscript

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to