Replying to some of your comments below. The rest were trivial issues that I'll just fix.

On 10/30/2014 09:19 PM, Andres Freund wrote:
* Is it really a good idea to separate XLogRegisterBufData() from
   XLogRegisterBuffer() the way you've done it ? If we ever actually get
   a record with a large numbers of blocks touched this issentially is
   O(touched_buffers*data_entries).

Are you worried that the linear search in XLogRegisterBufData(), to find the right registered_buffer struct, might be expensive? If that ever becomes a problem, a simple fix would be to start the linear search from the end, and make sure that when you touch a large number of blocks, you do all the XLogRegisterBufData() calls right after the corresponding XLogRegisterBuffer() call.

I've also though about having XLogRegisterBuffer() return the pointer to the struct, and passing it as argument to XLogRegisterBufData. So the pattern in WAL generating code would be like:

registered_buffer *buf0;

buf0 = XLogRegisterBuffer(0, REGBUF_STANDARD);
XLogRegisterBufData(buf0, data, length);

registered_buffer would be opaque to the callers. That would have potential to turn XLogRegisterBufData into a macro or inline function, to eliminate the function call overhead. I played with that a little bit, but the difference in performance was so small that it didn't seem worth it. But passing the registered_buffer pointer, like above, might not be a bad thing anyway.

* There's lots of functions like XLogRecHasBlockRef() that dig through
   the wal record. A common pattern is something like:
if (XLogRecHasBlockRef(record, 1))
     XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk);
else
     oldblk = newblk;

   I think doing that repeatedly is quite a bad idea. We should parse the
   record once and then use it in a sensible format. Not do it in pieces,
   over and over again. It's not like we ignore backup blocks - so doing
   this lazily doesn't make sense to me.
   Especially as ValidXLogRecord() *already* has parsed the whole damn
   thing once.

Hmm. Adding some kind of a parsed XLogRecord representation would need a fair amount of new infrastructure. Vast majority of WAL records contain one, maybe two, block references, so it's not that expensive to find the right one, even if you do it several times.

I ran a quick performance test on WAL replay performance yesterday. I ran pgbench for 1000000 transactions with WAL archiving enabled, and measured the time it took to replay the generated WAL. I did that with and without the patch, and I didn't see any big difference in replay times. I also ran "perf" on the startup process, and the profiles looked identical. I'll do more comprehensive testing later, with different index types, but I'm convinced that there is no performance issue in replay that we'd need to worry about.

If it mattered, a simple optimization to the above pattern would be to have XLogRecGetBlockTag return true/false, indicating if the block reference existed at all. So you'd do:

if (!XLogRecGetBlockTag(record, 1, NULL, NULL, &oldblk))
    oldblk != newblk;


On the other hand, decomposing the WAL record into parts, and passing the decomposed representation to the redo routines would allow us to pack the WAL record format more tightly, as accessing the different parts of the on-disk format wouldn't then need to be particularly fast. For example, I've been thinking that it would be nice to get rid of the alignment padding in XLogRecord, and between the per-buffer data portions. We could copy the data to aligned addresses as part of the decomposition or parsing of the WAL record, so that the redo routines could still assume aligned access.

* I wonder if it wouldn't be worthwile, for the benefit of the FPI
   compression patch, to keep the bkp block data after *all* the
   "headers". That'd make it easier to just compress the data.

Maybe. If we do that, I'd also be inclined to move all the bkp block headers to the beginning of the WAL record, just after the XLogInsert struct. Somehow it feels weird to have a bunch of header structs sandwiched between the rmgr-data and per-buffer data. Also, 4-byte alignment is enough for the XLogRecordBlockData struct, so that would be an easy way to make use of the space currently wasted for alignment padding in XLogRecord.

* I think heap_xlog_update is buggy for wal_level=logical because it
   computes the length of the tuple using
   tuplen = recdataend - recdata;
   But the old primary key/old tuple value might be stored there as
   well. Afaics that code has to continue to use xl_heap_header_len.

No, the old primary key or tuple is stored in the main data portion. That tuplen computation is done on backup block 0's data.

* It looks to me like insert wal logging could just use REGBUF_KEEP_DATA
   to get rid of:
+       /*
+        * The new tuple is normally stored as buffer 0's data. But if
+        * XLOG_HEAP_CONTAINS_NEW_TUPLE flag is set, it's part of the main
+        * data, after the xl_heap_insert struct.
+        */
+       if (xlrec->flags & XLOG_HEAP_CONTAINS_NEW_TUPLE)
+       {
+           data = XLogRecGetData(record) + SizeOfHeapInsert;
+           datalen = record->xl_len - SizeOfHeapInsert;
+       }
+       else
+           data = XLogRecGetBlockData(record, 0, &datalen);
  or have I misunderstood how that works?

Ah, you're right. Actually, the code that writes the WAL record *does* use REGBUF_KEEP_DATA. That was a bug in the redo routine, it should always look into buffer 0's data.

- Heikki



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to