Re: [HACKERS] WAL logging problem in 9.4.3?

Heikki Linnakangas Wed, 22 Jul 2015 09:22:08 -0700

On 07/22/2015 11:18 AM, Simon Riggs wrote:

On 10 July 2015 at 00:06, Tom Lane <t...@sss.pgh.pa.us> wrote:

Andres Freund <and...@anarazel.de> writes:

On 2015-07-06 11:49:54 -0400, Tom Lane wrote:

Rather than reverting cab9a0656c36739f, which would re-introduce a
different performance problem, perhaps we could have COPY create a new
relfilenode when it does this.  That should be safe if the table was
previously empty.

I'm not convinced that cab9a0656c36739f needs to survive in that
form. To me only allowing one COPY to benefit from the wal_level =
minimal optimization has a significantly higher cost than
cab9a0656c36739f.


What evidence have you got to base that value judgement on?

cab9a0656c36739f was based on an actual user complaint, so we have good
evidence that there are people out there who care about the cost of
truncating a table many times in one transaction.  On the other hand,
I know of no evidence that anyone's depending on multiple sequential
COPYs, nor intermixed COPY and INSERT, to be fast.  The original argument
for having this COPY optimization at all was to make restoring pg_dump
scripts in a single transaction fast; and that use-case doesn't care
about anything but a single COPY into a virgin table.


We have to backpatch this fix, so it must be both simple and effective.

Heikki's suggestions may be best, maybe not, but they don't seem
backpatchable.

Tom's suggestion looks good. So does Andres' suggestion. I have coded both.


Thanks. For comparison, I wrote a patch to implement what I had in mind.

When a WAL-skipping COPY begins, we add an entry for that relation in a "pending-fsyncs" hash table. Whenever we perform any action on a heap that would normally be WAL-logged, we check if the relation is in the hash table, and skip WAL-logging if so.

That was a simplified explanation. In reality, when WAL-skipping COPY begins, we also memorize the current size of the relation. Any actions on blocks greater than the old size are not WAL-logged, and any actions on smaller-numbered blocks are. This ensures that if you did any INSERTs on the table before the COPY, any new actions on the blocks that were already WAL-logged by the INSERT are also WAL-logged. And likewise if you perform any INSERTs after (or during, by trigger) the COPY, and they modify the new pages, those actions are not WAL-logged. So starting a WAL-skipping COPY splits the relation into two parts: the first part that is WAL-logged as usual, and the later part that is not WAL-logged. (there is one loose end marked with XXX in the patch on this, when one of the pages involved in a cold UPDATE is before the watermark and the other is after)

The actual fsync() has been moved to the end of transaction, as we are now skipping WAL-logging of any actions after the COPY as well.

And truncations complicate things further. If we emit a truncation WAL record in the transaction, we also make an entry in the hash table to record that. All operations on a relation that has been truncated must be WAL-logged as usual, because replaying the truncate record will destroy all data even if we fsync later. But we still optimize for "BEGIN; CREATE; COPY; TRUNCATE; COPY;" style patterns, because if we truncate a relation that has already been marked for fsync-at-COMMIT, we don't need to WAL-log the truncation either.

This is more invasive than I'd like to backpatch, but I think it's the simplest approach that works, and doesn't disable any of the important optimizations we have.

And what reason is there to think that this would fix all the problems?


I don't think either suggested fix could be claimed to be a great solution,
since there is little principle here, only heuristic. Heikki's solution
would be the only safe way, but is not backpatchable.

I can't get too excited about a half-fix that leaves you with data corruption in some scenarios.

I wrote a little test script to test all these different scenarios (attached). Both of your patches fail with the script.


- Heikki

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2e3b9d2..9ef688b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2021,12 +2021,6 @@ FreeBulkInsertState(BulkInsertState bistate)
  * The new tuple is stamped with current transaction ID and the specified
  * command ID.
  *
- * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
- * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
- * requires that we arrange that all new tuples go into new pages not
- * containing any tuples from other transactions, and that the relation gets
- * fsync'd before commit.  (See also heap_sync() comments)
- *
  * The HEAP_INSERT_SKIP_FSM option is passed directly to
  * RelationGetBufferForTuple, which see for more info.
  *
@@ -2115,7 +2109,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+	if (HeapNeedsWAL(relation, buffer))
 	{
 		xl_heap_insert xlrec;
 		xl_heap_header xlhdr;
@@ -2315,12 +2309,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 	int			ndone;
 	char	   *scratch = NULL;
 	Page		page;
-	bool		needwal;
 	Size		saveFreeSpace;
 	bool		need_tuple_data = RelationIsLogicallyLogged(relation);
 	bool		need_cids = RelationIsAccessibleInLogicalDecoding(relation);
 
-	needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
 	saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
 												   HEAP_DEFAULT_FILLFACTOR);
 
@@ -2335,7 +2327,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 	 * palloc() within a critical section is not safe, so we allocate this
 	 * beforehand.
 	 */
-	if (needwal)
+	if (RelationNeedsWAL(relation))
 		scratch = palloc(BLCKSZ);
 
 	/*
@@ -2401,7 +2393,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 		MarkBufferDirty(buffer);
 
 		/* XLOG stuff */
-		if (needwal)
+		if (HeapNeedsWAL(relation, buffer))
 		{
 			XLogRecPtr	recptr;
 			xl_heap_multi_insert *xlrec;
@@ -2888,7 +2880,7 @@ l1:
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (HeapNeedsWAL(relation, buffer))
 	{
 		xl_heap_delete xlrec;
 		XLogRecPtr	recptr;
@@ -3755,7 +3747,11 @@ l2:
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	/* FIXME: what if the old page must be WAL-logged, but the new one
+	 * must not?
+	 */
+	if (HeapNeedsWAL(relation, buffer) ||
+		HeapNeedsWAL(relation, newbuf))
 	{
 		XLogRecPtr	recptr;
 
@@ -4626,7 +4622,7 @@ failed:
 	 * (Also, in a PITR log-shipping or 2PC environment, we have to have XLOG
 	 * entries for everything anyway.)
 	 */
-	if (RelationNeedsWAL(relation))
+	if (HeapNeedsWAL(relation, *buffer))
 	{
 		xl_heap_lock xlrec;
 		XLogRecPtr	recptr;
@@ -5258,12 +5254,12 @@ l4:
 		MarkBufferDirty(buf);
 
 		/* XLOG stuff */
-		if (RelationNeedsWAL(rel))
+		if (HeapNeedsWAL(rel, buf))
 		{
 			xl_heap_lock_updated xlrec;
+			Page page = BufferGetPage(buf);
 			XLogRecPtr	recptr;
 			XLogRecData rdata[2];
-			Page		page = BufferGetPage(buf);
 
 			xlrec.target.node = rel->rd_node;
 			xlrec.target.tid = mytup.t_self;
@@ -5405,7 +5401,7 @@ heap_inplace_update(Relation relation, HeapTuple tuple)
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(relation))
+	if (HeapNeedsWAL(relation, buffer))
 	{
 		xl_heap_inplace xlrec;
 		XLogRecPtr	recptr;
@@ -8555,3 +8551,71 @@ heap_sync(Relation rel)
 		heap_close(toastrel, AccessShareLock);
 	}
 }
+
+/*
+ *	heap_register_sync	- register a heap to be synced to disk at commit
+ *
+ * This can be used to skip WAL-logging changes on a relation file that has
+ * been created in the same transaction. After calling this, any changes to
+ * the heap (including TOAST heap if any) in the same transaction will not be
+ * WAL-logged. Instead, the heap contents are flushed to disk at commit,
+ * like heap_sync() does.
+ *
+ * Like with heap_sync(), indexes are not touched.
+ */
+void
+heap_register_sync(Relation rel)
+{
+	/* non-WAL-logged tables never need fsync */
+	if (!RelationNeedsWAL(rel))
+		return;
+
+	smgrRegisterPendingSync(rel);
+	if (OidIsValid(rel->rd_rel->reltoastrelid))
+	{
+		Relation	toastrel;
+
+		toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
+		smgrRegisterPendingSync(toastrel);
+		heap_close(toastrel, AccessShareLock);
+	}
+}
+
+/*
+ * Do changes to given heap page need to be WAL-logged?
+ *
+ * This takes into account any previous heap_register_sync() requests.
+ *
+ * Note that it is required to use this before creating any WAL records for
+ * heap pages - it is not merely an optimization. WAL-logging a record,
+ * when we have already skipped a previous WAL record for the same page could
+ * lead to failure at WAL replay, as the "before" state expected by the
+ * record might not match what's on disk (this should only a be problem
+ * with full_page_writes=off, though).
+ */
+bool
+HeapNeedsWAL(Relation rel, Buffer buf)
+{
+	/* Temporary relations never need WAL */
+	if (!RelationNeedsWAL(rel))
+		return false;
+
+	/*
+	 * If we are going to fsync() the relation at COMMIT, and we have not
+	 * truncated the block away previously, and we have not emitted any WAL
+	 * records for this block yet, we can skip WAL-logging it.
+	 */
+	if (smgrIsSyncPending(rel->rd_node, BufferGetBlockNumber(buf)))
+	{
+		/*
+		 * If a pending fsync() will handle this page, its LSN should be
+		 * invalid. If it's not, we've already emitted a WAL record for this
+		 * block, and all subsequent changes to the block must be WAL-logged
+		 * too.
+		 */
+		Assert(PageGetLSN(BufferGetPage(buf)) == InvalidXLogRecPtr);
+		return false;
+	}
+
+	return true;
+}
diff --git a/src/backend/access/heap/pruneheap.c b/src/backend/access/heap/pruneheap.c
index 06b5488..e342cbb 100644
--- a/src/backend/access/heap/pruneheap.c
+++ b/src/backend/access/heap/pruneheap.c
@@ -250,7 +250,7 @@ heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
 		/*
 		 * Emit a WAL HEAP_CLEAN record showing what we did
 		 */
-		if (RelationNeedsWAL(relation))
+		if (HeapNeedsWAL(relation, buffer))
 		{
 			XLogRecPtr	recptr;
 
diff --git a/src/backend/access/heap/visibilitymap.c b/src/backend/access/heap/visibilitymap.c
index a0c0c7f..7832dee 100644
--- a/src/backend/access/heap/visibilitymap.c
+++ b/src/backend/access/heap/visibilitymap.c
@@ -278,6 +278,7 @@ visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
 		map[mapByte] |= (1 << mapBit);
 		MarkBufferDirty(vmBuf);
 
+		/* XXX: Should we use HeapNeedsWAL here? */
 		if (RelationNeedsWAL(rel))
 		{
 			if (XLogRecPtrIsInvalid(recptr))
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 9f989f8..306c6c1 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1921,6 +1921,9 @@ CommitTransaction(void)
 	/* close large objects before lower-level cleanup */
 	AtEOXact_LargeObject(true);
 
+	/* Flush updates to relations that we didn't WAL-logged */
+	smgrDoPendingSyncs(true);
+
 	/*
 	 * Mark serializable transaction as complete for predicate locking
 	 * purposes.  This should be done as late as we can put it and still allow
@@ -2126,6 +2129,9 @@ PrepareTransaction(void)
 	/* close large objects before lower-level cleanup */
 	AtEOXact_LargeObject(true);
 
+	/* Flush updates to relations that we didn't WAL-logged */
+	smgrDoPendingSyncs(true);
+
 	/*
 	 * Mark serializable transaction as complete for predicate locking
 	 * purposes.  This should be done as late as we can put it and still allow
@@ -2412,6 +2418,7 @@ AbortTransaction(void)
 	AtAbort_Notify();
 	AtEOXact_RelationMap(false);
 	AtAbort_Twophase();
+	smgrDoPendingSyncs(false);
 
 	/*
 	 * Advertise the fact that we aborted in pg_clog (assuming that we got as
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8e9754c..39306dd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -951,6 +951,15 @@ begin:;
 				}
 				if (dtbuf[i] == InvalidBuffer)
 				{
+					{
+						RelFileNode rnode;
+						ForkNumber forknum;
+						BlockNumber blknum;
+						BufferGetTag(rdt->buffer, &rnode, &forknum, &blknum);
+						if (rnode.relNode >= FirstNormalObjectId && rmid != RM_BTREE_ID)
+							elog(LOG, "WAL-logging update to rel %u block %u (rmid %d info %X)", rnode.relNode, blknum, rmid, info);
+					}
+
 					/* OK, put it in this slot */
 					dtbuf[i] = rdt->buffer;
 					if (doPageWrites && XLogCheckBuffer(rdt, true,
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index c3b2f07..5162904 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -20,6 +20,7 @@
 #include "postgres.h"
 
 #include "access/visibilitymap.h"
+#include "access/transam.h"
 #include "access/xact.h"
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
@@ -27,6 +28,7 @@
 #include "catalog/storage_xlog.h"
 #include "storage/freespace.h"
 #include "storage/smgr.h"
+#include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
@@ -62,6 +64,42 @@ typedef struct PendingRelDelete
 static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
 
 /*
+ * We also track relation files (RelFileNode values) that have been created
+ * in the same transaction, and that have been modified without WAL-logging
+ * the action. When we are about to begin a large operation on the relation,
+ * a PendingRelSync entry is created, and 'sync_above' is set to the current
+ * size of the relation. Any operations on blocks < sync_above need to be
+ * WAL-logged as usual, but for operations on higher blocks, WAL-logging is
+ * skipped. It's important that after WAL-logging has been skipped for a
+ * block, we don't WAL log any subsequent actions on the same block either.
+ * Replaying the WAl record of the subsequent action might fail otherwise,
+ * as the "before" state of the block might not match, as the earlier actions
+ * were not WAL-logged.
+ *
+ * If a relation is truncated (without creating a new relfilenode), and we
+ * emit a WAL record of the truncation, we cannot skip WAL-logging for that
+ * relation anymore, as replaying the truncation record will destroy all the
+ * data inserted after that. But if we have already decided to skip WAL-logging
+ * changes to a relation, and the relation is truncated, we don't need to
+ * WAL-log the truncation either.
+ *
+ * This mechanism is currently only used by heaps. Indexes are always
+ * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+ * WAL levels we need the WAL for PITR/replication anyway.
+ */
+/* Relations that need to be fsync'd at commit */
+typedef struct PendingRelSync
+{
+	RelFileNode relnode;		/* relation created in same xact */
+	BlockNumber	sync_above;		/* WAL-logging skipped for blocks >= sync_above */
+	bool		truncated;		/* truncation WAL record was written */
+} PendingRelSync;
+
+static HTAB *pendingSyncs = NULL;
+
+static void createPendingSyncsHash(void);
+
+/*
  * RelationCreateStorage
  *		Create physical storage for a relation.
  *
@@ -228,6 +266,8 @@ RelationPreserveStorage(RelFileNode rnode, bool atCommit)
 void
 RelationTruncate(Relation rel, BlockNumber nblocks)
 {
+	PendingRelSync *pending = NULL;
+	bool		found;
 	bool		fsm;
 	bool		vm;
 
@@ -251,6 +291,17 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 	if (vm)
 		visibilitymap_truncate(rel, nblocks);
 
+	if (!pendingSyncs)
+		createPendingSyncsHash();
+	pending = (PendingRelSync *) hash_search(pendingSyncs,
+											 (void *) &rel->rd_node,
+											 HASH_ENTER, &found);
+	if (!found)
+	{
+		pending->sync_above = InvalidBlockNumber;
+		pending->truncated = false;
+	}
+
 	/*
 	 * We WAL-log the truncation before actually truncating, which means
 	 * trouble if the truncation fails. If we then crash, the WAL replay
@@ -260,7 +311,8 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 	 * failure to truncate, that might spell trouble at WAL replay, into a
 	 * certain PANIC.
 	 */
-	if (RelationNeedsWAL(rel))
+	if (RelationNeedsWAL(rel) &&
+		(pending->sync_above == InvalidBlockNumber || pending->sync_above < nblocks))
 	{
 		/*
 		 * Make an XLOG entry reporting the file truncation.
@@ -279,6 +331,9 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 
 		lsn = XLogInsert(RM_SMGR_ID, XLOG_SMGR_TRUNCATE, &rdata);
 
+		if (rel->rd_node.relNode >= FirstNormalObjectId)
+			elog(LOG, "WAL-logged truncation of rel %u to %u blocks", rel->rd_node.relNode, nblocks);
+
 		/*
 		 * Flush, because otherwise the truncation of the main relation might
 		 * hit the disk before the WAL record, and the truncation of the FSM
@@ -288,6 +343,8 @@ RelationTruncate(Relation rel, BlockNumber nblocks)
 		 */
 		if (fsm || vm)
 			XLogFlush(lsn);
+
+		pending->truncated = true;
 	}
 
 	/* Do the real work */
@@ -422,6 +479,142 @@ smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
 	return nrels;
 }
 
+/* create the hash table to track pending at-commit fsyncs */
+static void
+createPendingSyncsHash(void)
+{
+	/* First time through: initialize the hash table */
+	HASHCTL		ctl;
+
+	MemSet(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(RelFileNode);
+	ctl.entrysize = sizeof(PendingRelSync);
+	ctl.hash = tag_hash;
+	pendingSyncs = hash_create("pending relation sync table", 5,
+							   &ctl, HASH_ELEM | HASH_FUNCTION);
+}
+
+/*
+ * Remember that the given relation needs to be sync'd at commit, because
+ * we are going to skip WAL-logging subsequent actions to it.
+ */
+void
+smgrRegisterPendingSync(Relation rel)
+{
+	PendingRelSync *pending;
+	bool		found;
+	BlockNumber	nblocks;
+
+	nblocks = RelationGetNumberOfBlocks(rel);
+
+	if (!pendingSyncs)
+		createPendingSyncsHash();
+
+	/* Look up or create an entry */
+	pending = (PendingRelSync *) hash_search(pendingSyncs,
+											 (void *) &rel->rd_node,
+											 HASH_ENTER, &found);
+	if (!found)
+	{
+		pending->truncated = false;
+		pending->sync_above = nblocks;
+
+		if (rel->rd_node.relNode >= FirstNormalObjectId)
+			elog(LOG, "Registering new pending sync for rel %u at block %u", rel->rd_node.relNode, nblocks);
+
+	}
+	else if (pending->sync_above == InvalidBlockNumber)
+	{
+		if (rel->rd_node.relNode >= FirstNormalObjectId)
+			elog(LOG, "Registering pending sync for rel %u at block %u", rel->rd_node.relNode, nblocks);
+		pending->sync_above = nblocks;
+	}
+	else
+		if (rel->rd_node.relNode >= FirstNormalObjectId)
+			elog(LOG, "Not updating pending sync for rel %u at block %u (was %u)", rel->rd_node.relNode, nblocks, pending->sync_above);
+}
+
+/*
+ * Are we going to fsync() this relation at COMMIT, and hence don't need to
+ * WAL-log changes to the given block?
+ */
+bool
+smgrIsSyncPending(RelFileNode rnode, BlockNumber blkno)
+{
+	PendingRelSync *pending;
+	bool		found;
+
+	if (!pendingSyncs)
+		return false;
+
+	pending = (PendingRelSync *) hash_search(pendingSyncs,
+											 (void *) &rnode,
+											 HASH_FIND, &found);
+	if (!found)
+		return false;
+
+	/*
+	 * We have no fsync() pending for this relation, or we have (possibly)
+	 * already emitted WAL records for this block.
+	 */
+	if (pending->sync_above == InvalidBlockNumber ||
+		pending->sync_above > blkno)
+	{
+		if (rnode.relNode >= FirstNormalObjectId)
+			elog(LOG, "Not skipping WAL-logging for rel %u block %u, because sync_above is %u", rnode.relNode, blkno, pending->sync_above);
+		return false;
+	}
+
+	/*
+	 * We have emitted a truncation record for this block.
+	 */
+	if (pending->truncated)
+	{
+		if (rnode.relNode >= FirstNormalObjectId)
+			elog(LOG, "Not skipping WAL-logging for rel %u block %u, because it was truncated", rnode.relNode, blkno);
+		return false;
+	}
+
+	if (rnode.relNode >= FirstNormalObjectId)
+		elog(LOG, "Skipping WAL-logging for rel %u block %u", rnode.relNode, blkno);
+
+	return true;
+}
+
+/*
+ * Sync to disk any relations that we skipped WAL-logging for earlier.
+ */
+void
+smgrDoPendingSyncs(bool isCommit)
+{
+	if (!pendingSyncs)
+		return;
+
+	if (isCommit)
+	{
+		HASH_SEQ_STATUS status;
+		PendingRelSync *pending;
+
+		hash_seq_init(&status, pendingSyncs);
+
+		while ((pending = hash_seq_search(&status)) != NULL)
+		{
+			if (pending->sync_above != InvalidBlockNumber)
+			{
+				FlushRelFileNodeBuffers(pending->relnode, false);
+				/* FlushRelationBuffers will have opened rd_smgr */
+				smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+
+				if (pending->relnode.relNode >= FirstNormalObjectId)
+					elog(LOG, "Syncing rel %u", pending->relnode.relNode);
+			}
+		}
+	}
+
+	hash_destroy(pendingSyncs);
+	pendingSyncs = NULL;
+}
+
 /*
  *	PostPrepare_smgr -- Clean up after a successful PREPARE
  *
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3eba9ef..c6aa608 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -26,6 +26,7 @@
 #include "access/xact.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_type.h"
+#include "catalog/storage.h"
 #include "commands/copy.h"
 #include "commands/defrem.h"
 #include "commands/trigger.h"
@@ -2152,7 +2153,10 @@ CopyFrom(CopyState cstate)
 	{
 		hi_options |= HEAP_INSERT_SKIP_FSM;
 		if (!XLogIsNeeded())
+		{
+			heap_register_sync(cstate->rel);
 			hi_options |= HEAP_INSERT_SKIP_WAL;
+		}
 	}
 
 	/*
@@ -2400,11 +2404,11 @@ CopyFrom(CopyState cstate)
 	FreeExecutorState(estate);
 
 	/*
-	 * If we skipped writing WAL, then we need to sync the heap (but not
-	 * indexes since those use WAL anyway)
+	 * If we skipped writing WAL, then we will sync the heap at the end of
+	 * the transaction (we used to do it here, but it was later found out
+	 * that to be safe, we must avoid WAL-logging any subsequent actions on
+	 * the pages we skipped WAL for). Indexes always use WAL.
 	 */
-	if (hi_options & HEAP_INSERT_SKIP_WAL)
-		heap_sync(cstate->rel);
 
 	return processed;
 }
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 3778d9d..180f2d9 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -932,8 +932,8 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 				heap_execute_freeze_tuple(htup, &frozen[i]);
 			}
 
-			/* Now WAL-log freezing if neccessary */
-			if (RelationNeedsWAL(onerel))
+			/* Now WAL-log freezing if necessary */
+			if (HeapNeedsWAL(onerel, buf))
 			{
 				XLogRecPtr	recptr;
 
@@ -1220,7 +1220,7 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (RelationNeedsWAL(onerel))
+	if (HeapNeedsWAL(onerel, buffer))
 	{
 		XLogRecPtr	recptr;
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 18013d5..011bab0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -108,6 +108,7 @@ static volatile BufferDesc *BufferAlloc(SMgrRelation smgr,
 			BufferAccessStrategy strategy,
 			bool *foundPtr);
 static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
+static void FlushRelFileNodeBuffers_internal(SMgrRelation smgr, bool islocal);
 static void AtProcExit_Buffers(int code, Datum arg);
 static int	rnode_comparator(const void *p1, const void *p2);
 
@@ -2426,18 +2427,31 @@ PrintPinnedBufs(void)
 void
 FlushRelationBuffers(Relation rel)
 {
-	int			i;
-	volatile BufferDesc *bufHdr;
-
 	/* Open rel at the smgr level if not already done */
 	RelationOpenSmgr(rel);
 
-	if (RelationUsesLocalBuffers(rel))
+	FlushRelFileNodeBuffers_internal(rel->rd_smgr, RelationUsesLocalBuffers(rel));
+}
+
+void
+FlushRelFileNodeBuffers(RelFileNode rnode, bool islocal)
+{
+	FlushRelFileNodeBuffers_internal(smgropen(rnode, InvalidBackendId), islocal);
+}
+
+static void
+FlushRelFileNodeBuffers_internal(SMgrRelation smgr, bool islocal)
+{
+	RelFileNode rnode = smgr->smgr_rnode.node;
+	int			i;
+	volatile BufferDesc *bufHdr;
+
+	if (islocal)
 	{
 		for (i = 0; i < NLocBuffer; i++)
 		{
 			bufHdr = &LocalBufferDescriptors[i];
-			if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+			if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
 				(bufHdr->flags & BM_VALID) && (bufHdr->flags & BM_DIRTY))
 			{
 				ErrorContextCallback errcallback;
@@ -2453,7 +2467,7 @@ FlushRelationBuffers(Relation rel)
 
 				PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-				smgrwrite(rel->rd_smgr,
+				smgrwrite(smgr,
 						  bufHdr->tag.forkNum,
 						  bufHdr->tag.blockNum,
 						  localpage,
@@ -2480,16 +2494,16 @@ FlushRelationBuffers(Relation rel)
 		 * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
 		 * and saves some cycles.
 		 */
-		if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
+		if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
 			continue;
 
 		LockBufHdr(bufHdr);
-		if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
+		if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
 			(bufHdr->flags & BM_VALID) && (bufHdr->flags & BM_DIRTY))
 		{
 			PinBuffer_Locked(bufHdr);
 			LWLockAcquire(bufHdr->content_lock, LW_SHARED);
-			FlushBuffer(bufHdr, rel->rd_smgr);
+			FlushBuffer(bufHdr, smgr);
 			LWLockRelease(bufHdr->content_lock);
 			UnpinBuffer(bufHdr, true);
 		}
@@ -2662,6 +2676,9 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		XLogRecPtr	lsn = InvalidXLogRecPtr;
 		bool		dirtied = false;
 		bool		delayChkpt = false;
+		RelFileNode rnode;
+		ForkNumber	forknum;
+		BlockNumber blknum;
 
 		/*
 		 * If we need to protect hint bit updates from torn writes, WAL-log a
@@ -2672,7 +2689,9 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 		 * We don't check full_page_writes here because that logic is included
 		 * when we call XLogInsert() since the value changes dynamically.
 		 */
-		if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT))
+		BufferGetTag(buffer, &rnode, &forknum, &blknum);
+		if (XLogHintBitIsNeeded() && (bufHdr->flags & BM_PERMANENT) &&
+			!smgrIsSyncPending(rnode, blknum))
 		{
 			/*
 			 * If we're in recovery we cannot dirty a page because of a hint.
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 493839f..3fdd5a1 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -24,7 +24,7 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL	0x0001
+#define HEAP_INSERT_SKIP_WAL	0x0001 /* obsolete, not used anymore */
 #define HEAP_INSERT_SKIP_FSM	0x0002
 #define HEAP_INSERT_FROZEN		0x0004
 
@@ -161,6 +161,7 @@ extern void simple_heap_update(Relation relation, ItemPointer otid,
 extern void heap_markpos(HeapScanDesc scan);
 extern void heap_restrpos(HeapScanDesc scan);
 
+extern void heap_register_sync(Relation relation);
 extern void heap_sync(Relation relation);
 
 /* in heap/pruneheap.c */
diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h
index 9557486..70c5f75 100644
--- a/src/include/access/heapam_xlog.h
+++ b/src/include/access/heapam_xlog.h
@@ -16,6 +16,7 @@
 
 #include "access/htup.h"
 #include "access/xlog.h"
+#include "catalog/storage.h"
 #include "storage/bufpage.h"
 #include "storage/relfilenode.h"
 #include "utils/relcache.h"
@@ -375,6 +376,8 @@ extern void heap2_redo(XLogRecPtr lsn, XLogRecord *rptr);
 extern void heap2_desc(StringInfo buf, uint8 xl_info, char *rec);
 extern void heap_xlog_logical_rewrite(XLogRecPtr lsn, XLogRecord *r);
 
+extern bool HeapNeedsWAL(Relation rel, Buffer buf);
+
 extern XLogRecPtr log_heap_cleanup_info(RelFileNode rnode,
 					  TransactionId latestRemovedXid);
 extern XLogRecPtr log_heap_clean(Relation reln, Buffer buffer,
diff --git a/src/include/catalog/storage.h b/src/include/catalog/storage.h
index 4b87a36..e56d252 100644
--- a/src/include/catalog/storage.h
+++ b/src/include/catalog/storage.h
@@ -29,6 +29,9 @@ extern void RelationTruncate(Relation rel, BlockNumber nblocks);
  */
 extern void smgrDoPendingDeletes(bool isCommit);
 extern int	smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+extern void smgrRegisterPendingSync(Relation rel);
+extern bool smgrIsSyncPending(RelFileNode rnode, BlockNumber blkno);
+extern void smgrDoPendingSyncs(bool isCommit);
 extern void AtSubCommit_smgr(void);
 extern void AtSubAbort_smgr(void);
 extern void PostPrepare_smgr(void);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 921e4ed..590ab08 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -192,6 +192,7 @@ extern BlockNumber BufferGetBlockNumber(Buffer buffer);
 extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
 								ForkNumber forkNum);
 extern void FlushRelationBuffers(Relation rel);
+extern void FlushRelFileNodeBuffers(RelFileNode rel, bool islocal);
 extern void FlushDatabaseBuffers(Oid dbid);
 extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
 					   ForkNumber forkNum, BlockNumber firstDelBlock);

test-wal-minimal.sh
Description: application/shellscript

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL logging problem in 9.4.3?

Reply via email to