Re: [HACKERS] WAL format and API changes (9.5)

Heikki Linnakangas Thu, 06 Nov 2014 07:34:36 -0800

Replying to some of your comments below. The rest were trivial issuesthat I'll just fix.


On 10/30/2014 09:19 PM, Andres Freund wrote:

* Is it really a good idea to separate XLogRegisterBufData() from
   XLogRegisterBuffer() the way you've done it ? If we ever actually get
   a record with a large numbers of blocks touched this issentially is
   O(touched_buffers*data_entries).

Are you worried that the linear search in XLogRegisterBufData(), to findthe right registered_buffer struct, might be expensive? If that everbecomes a problem, a simple fix would be to start the linear search fromthe end, and make sure that when you touch a large number of blocks, youdo all the XLogRegisterBufData() calls right after the correspondingXLogRegisterBuffer() call.

I've also though about having XLogRegisterBuffer() return the pointer tothe struct, and passing it as argument to XLogRegisterBufData. So thepattern in WAL generating code would be like:


registered_buffer *buf0;

buf0 = XLogRegisterBuffer(0, REGBUF_STANDARD);
XLogRegisterBufData(buf0, data, length);

registered_buffer would be opaque to the callers. That would havepotential to turn XLogRegisterBufData into a macro or inline function,to eliminate the function call overhead. I played with that a littlebit, but the difference in performance was so small that it didn't seemworth it. But passing the registered_buffer pointer, like above, mightnot be a bad thing anyway.

* There's lots of functions like XLogRecHasBlockRef() that dig through
   the wal record. A common pattern is something like:
if (XLogRecHasBlockRef(record, 1))
     XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk);
else
     oldblk = newblk;

   I think doing that repeatedly is quite a bad idea. We should parse the
   record once and then use it in a sensible format. Not do it in pieces,
   over and over again. It's not like we ignore backup blocks - so doing
   this lazily doesn't make sense to me.
   Especially as ValidXLogRecord() *already* has parsed the whole damn
   thing once.

Hmm. Adding some kind of a parsed XLogRecord representation would need afair amount of new infrastructure. Vast majority of WAL records containone, maybe two, block references, so it's not that expensive to find theright one, even if you do it several times.

I ran a quick performance test on WAL replay performance yesterday. Iran pgbench for 1000000 transactions with WAL archiving enabled, andmeasured the time it took to replay the generated WAL. I did that withand without the patch, and I didn't see any big difference in replaytimes. I also ran "perf" on the startup process, and the profiles lookedidentical. I'll do more comprehensive testing later, with differentindex types, but I'm convinced that there is no performance issue inreplay that we'd need to worry about.

If it mattered, a simple optimization to the above pattern would be tohave XLogRecGetBlockTag return true/false, indicating if the blockreference existed at all. So you'd do:


if (!XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))
    oldblk != newblk;

On the other hand, decomposing the WAL record into parts, and passingthe decomposed representation to the redo routines would allow us topack the WAL record format more tightly, as accessing the differentparts of the on-disk format wouldn't then need to be particularly fast.For example, I've been thinking that it would be nice to get rid of thealignment padding in XLogRecord, and between the per-buffer dataportions. We could copy the data to aligned addresses as part of thedecomposition or parsing of the WAL record, so that the redo routinescould still assume aligned access.

* I wonder if it wouldn't be worthwile, for the benefit of the FPI
   compression patch, to keep the bkp block data after *all* the
   "headers". That'd make it easier to just compress the data.

Maybe. If we do that, I'd also be inclined to move all the bkp blockheaders to the beginning of the WAL record, just after the XLogInsertstruct. Somehow it feels weird to have a bunch of header structssandwiched between the rmgr-data and per-buffer data. Also, 4-bytealignment is enough for the XLogRecordBlockData struct, so that would bean easy way to make use of the space currently wasted for alignmentpadding in XLogRecord.

* I think heap_xlog_update is buggy for wal_level=logical because it
   computes the length of the tuple using
   tuplen = recdataend - recdata;
   But the old primary key/old tuple value might be stored there as
   well. Afaics that code has to continue to use xl_heap_header_len.

No, the old primary key or tuple is stored in the main data portion.That tuplen computation is done on backup block 0's data.

* It looks to me like insert wal logging could just use REGBUF_KEEP_DATA
   to get rid of:
+       /*
+        * The new tuple is normally stored as buffer 0's data. But if
+        * XLOG_HEAP_CONTAINS_NEW_TUPLE flag is set, it's part of the main
+        * data, after the xl_heap_insert struct.
+        */
+       if (xlrec->flags & XLOG_HEAP_CONTAINS_NEW_TUPLE)
+       {
+           data = XLogRecGetData(record) + SizeOfHeapInsert;
+           datalen = record->xl_len - SizeOfHeapInsert;
+       }
+       else
+           data = XLogRecGetBlockData(record, 0, &datalen);
  or have I misunderstood how that works?

Ah, you're right. Actually, the code that writes the WAL record *does*use REGBUF_KEEP_DATA. That was a bug in the redo routine, it shouldalways look into buffer 0's data.


- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL format and API changes (9.5)

Reply via email to