Re: [HACKERS] WAL logging problem in 9.4.3?

Kyotaro HORIGUCHI Mon, 11 Sep 2017 21:19:50 -0700

Hello,

At Fri, 08 Sep 2017 16:30:01 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
<horiguchi.kyot...@lab.ntt.co.jp> wrote in 
<20170908.163001.53230385.horiguchi.kyot...@lab.ntt.co.jp>
> > >> 2017-04-13 12:11:27.065 JST [85441] t/102_vacuumdb_stages.pl
> > >> STATEMENT:  ANALYZE;
> > >> 2017-04-13 12:12:25.766 JST [85492] LOG:  BufferNeedsWAL: pendingSyncs
> > >> = 0x0, no_pending_sync = 0
> > >> 
> > >> -       lsn = XLogInsert(RM_SMGR_ID,
> > >> -                        XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
> > >> +           rel->no_pending_sync= false;
> > >> +           rel->pending_sync = pending;
> > >> +       }
> > >> 
> > >> It seems to me that those flags and the pending_sync data should be
> > >> kept in the context of backend process and not be part of the Relation
> > >> data...
> > > 
> > > I understand that the context of "backend process" means
> > > storage.c local. I don't mind the context on which the data is,
> > > but I found only there that can get rid of frequent hash
> > > searching. For pending deletions, just appending to a list is
> > > enough and costs almost nothing, on the other hand pendig syncs
> > > are required to be referenced, sometimes very frequently.
> > > 
> > >> +void
> > >> +RecordPendingSync(Relation rel)
> > >> I don't think that I agree that this should be part of relcache.c. The
> > >> syncs are tracked should be tracked out of the relation context.
> > > 
> > > Yeah.. It's in storage.c in the latest patch. (Sorry for the
> > > duplicate name). I think it is a kind of bond between smgr and
> > > relation.
> > > 
> > >> Seeing how invasive this change is, I would also advocate for this
> > >> patch as only being a HEAD-only change, not many people are
> > >> complaining about this optimization of TRUNCATE missing when wal_level
> > >> = minimal, and this needs a very careful review.
> > > 
> > > Agreed.
> > > 
> > >> Should I code something? Or Horiguchi-san, would you take care of it?
> > >> The previous crash I saw has been taken care of, but it's been really
> > >> some time since I looked at this patch...
> > > 
> > > My point is hash-search on every tuple insertion should be evaded
> > > even if it happens rearely. Once it was a bit apart from your
> > > original patch, but in the latest patch the significant part
> > > (pending-sync hash) is revived from the original one.
> > 
> > This patch has followed along since CF 2016-03, do we think we can reach a
> > conclusion in this CF?  It was marked as "Waiting on Author”, based on
> > developments since in this thread, I’ve changed it back to “Needs Review”
> > again.
> 
> I manged to reload its context into my head. It doesn't apply on
> the current master and needs some amendment. I'm going to work on
> this.


Rebased and slightly modified.

Michael's latest patch on which this patch is piggybacking seems
works perfectly. The motive of my addition is avoiding frequent
(I think specifically per tuple modification) hash accessing
occurs while pending-syncs exist. The hash contains at least 6 or
more entries.

The attached patch emits more log messages that will be removed
in the final shape to see how much the addition reduces the hash
access.  As a basis of determining the worthiness of the
additional mechanism, I'll show an example of a set of queries
below.

In the log messages, "r" is relation oid, "b" is buffer number,
"hash" is the pointer to the backend-global hash table for
pending syncs. "ent" is the entry in the hash belongs to the
relation, "neg" is a flag indicates that the existing pending
sync hash doesn't have an entry for the relation.

=# set log_min_message to debug2;
=# begin;
=# create table test1(a text primary key);
> DEBUG:  BufferNeedsWAL(r 2608, b 55): hash = (nil), ent=(nil), neg = 0
# relid=2608 buf=55, hash has not been created

=# insert into test1 values ('inserted row');
> DEBUG:  BufferNeedsWAL(r 24807, b 0): hash = (nil), ent=(nil), neg = 0
# relid=24807, fist buffer, hash has not bee created

=# copy test1 from '/<somewhere>/copy_data.txt';
> DEBUG:  BufferNeedsWAL(r 24807, b 0): hash = 0x171de00, ent=0x171f390, neg = 0
# hash created, pending sync entry linked, no longer needs hash acess
# (repeats for the number of buffers)
COPY 200

=# create table test3(a text primary key);
> DEBUG:  BufferNeedsWAL(r 2608, b 55): hash = 0x171de00, ent=(nil), neg = 1
# no pending sync entry for this relation, no longer needs hash access.

=# insert into test3 (select a from generate_series(0, 99) a);
> DEBUG:  BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=(nil), neg = 0
> DEBUG:  BufferNeedsWAL: accessing hash : not found
> DEBUG:  BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=(nil), neg = 1
# This table no longer needs hash access, (repeats for the number of tuples)

=#  truncate test3;
=#  insert into test3 (select a from generate_series(0, 99) a);
> DEBUG:  BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=(nil), neg = 0
> DEBUG:  BufferNeedsWAL: accessing hash : found
> DEBUG:  BufferNeedsWAL(r 24816, b 0): hash = 0x171de00, ent=0x171f340, neg = 0
# This table has pending sync but no longer needs hash access,
#  (repeats for the number of tuples)

The hash is required in the case of relcache invalidation. When
ent=(nil) and neg = 0 but hash != (nil), it tries hash search and
restores the previous state.

This mechanism avoids most of the hash accesses by replacing into
just following a pointer. On the other hand, the hash access
occurs only after relation truncate in the current
transaction. In other words, this won't be in effect unless any
of table truncation, copy, create as, alter table or refresing
matview occurs.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 34,39 ****
--- 34,61 ----
   *	  the POSTGRES heap access method used for all POSTGRES
   *	  relations.
   *
+  * WAL CONSIDERATIONS
+  *	  All heap operations are normally WAL-logged. but there are a few
+  *	  exceptions. Temporary and unlogged relations never need to be
+  *	  WAL-logged, but we can also skip WAL-logging for a table that was
+  *	  created in the same transaction, if we don't need WAL for PITR or
+  *	  WAL archival purposes (i.e. if wal_level=minimal), and we fsync()
+  *	  the file to disk at COMMIT instead.
+  *
+  *	  The same-relation optimization is not employed automatically on all
+  *	  updates to a table that was created in the same transacton, because
+  *	  for a small number of changes, it's cheaper to just create the WAL
+  *	  records than fsyncing() the whole relation at COMMIT. It is only
+  *	  worthwhile for (presumably) large operations like COPY, CLUSTER,
+  *	  or VACUUM FULL. Use heap_register_sync() to initiate such an
+  *	  operation; it will cause any subsequent updates to the table to skip
+  *	  WAL-logging, if possible, and cause the heap to be synced to disk at
+  *	  COMMIT.
+  *
+  *	  To make that work, all modifications to heap must use
+  *	  HeapNeedsWAL() to check if WAL-logging is needed in this transaction
+  *	  for the given block.
+  *
   *-------------------------------------------------------------------------
   */
  #include "postgres.h"
***************
*** 56,61 ****
--- 78,84 ----
  #include "access/xlogutils.h"
  #include "catalog/catalog.h"
  #include "catalog/namespace.h"
+ #include "catalog/storage.h"
  #include "miscadmin.h"
  #include "pgstat.h"
  #include "port/atomics.h"
***************
*** 2370,2381 **** ReleaseBulkInsertStatePin(BulkInsertState bistate)
   * The new tuple is stamped with current transaction ID and the specified
   * command ID.
   *
-  * If the HEAP_INSERT_SKIP_WAL option is specified, the new tuple is not
-  * logged in WAL, even for a non-temp relation.  Safe usage of this behavior
-  * requires that we arrange that all new tuples go into new pages not
-  * containing any tuples from other transactions, and that the relation gets
-  * fsync'd before commit.  (See also heap_sync() comments)
-  *
   * The HEAP_INSERT_SKIP_FSM option is passed directly to
   * RelationGetBufferForTuple, which see for more info.
   *
--- 2393,2398 ----
*** a/src/backend/access/heap/pruneheap.c
--- b/src/backend/access/heap/pruneheap.c
***************
*** 20,25 ****
--- 20,26 ----
  #include "access/htup_details.h"
  #include "access/xlog.h"
  #include "catalog/catalog.h"
+ #include "catalog/storage.h"
  #include "miscadmin.h"
  #include "pgstat.h"
  #include "storage/bufmgr.h"
***************
*** 259,265 **** heap_page_prune(Relation relation, Buffer buffer, TransactionId OldestXmin,
  		/*
  		 * Emit a WAL HEAP_CLEAN record showing what we did
  		 */
! 		if (RelationNeedsWAL(relation))
  		{
  			XLogRecPtr	recptr;
  
--- 260,266 ----
  		/*
  		 * Emit a WAL HEAP_CLEAN record showing what we did
  		 */
! 		if (BufferNeedsWAL(relation, buffer))
  		{
  			XLogRecPtr	recptr;
  
*** a/src/backend/access/heap/rewriteheap.c
--- b/src/backend/access/heap/rewriteheap.c
***************
*** 649,657 **** raw_heap_insert(RewriteState state, HeapTuple tup)
  	}
  	else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
  		heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
! 										 HEAP_INSERT_SKIP_FSM |
! 										 (state->rs_use_wal ?
! 										  0 : HEAP_INSERT_SKIP_WAL));
  	else
  		heaptup = tup;
  
--- 649,655 ----
  	}
  	else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
  		heaptup = toast_insert_or_update(state->rs_new_rel, tup, NULL,
! 										 HEAP_INSERT_SKIP_FSM);
  	else
  		heaptup = tup;
  
*** a/src/backend/access/heap/visibilitymap.c
--- b/src/backend/access/heap/visibilitymap.c
***************
*** 88,93 ****
--- 88,94 ----
  #include "access/heapam_xlog.h"
  #include "access/visibilitymap.h"
  #include "access/xlog.h"
+ #include "catalog/storage.h"
  #include "miscadmin.h"
  #include "storage/bufmgr.h"
  #include "storage/lmgr.h"
***************
*** 307,313 **** visibilitymap_set(Relation rel, BlockNumber heapBlk, Buffer heapBuf,
  		map[mapByte] |= (flags << mapOffset);
  		MarkBufferDirty(vmBuf);
  
! 		if (RelationNeedsWAL(rel))
  		{
  			if (XLogRecPtrIsInvalid(recptr))
  			{
--- 308,314 ----
  		map[mapByte] |= (flags << mapOffset);
  		MarkBufferDirty(vmBuf);
  
! 		if (BufferNeedsWAL(rel, heapBuf))
  		{
  			if (XLogRecPtrIsInvalid(recptr))
  			{
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 2007,2012 **** CommitTransaction(void)
--- 2007,2015 ----
  	/* close large objects before lower-level cleanup */
  	AtEOXact_LargeObject(true);
  
+ 	/* Flush updates to relations that we didn't WAL-logged */
+ 	smgrDoPendingSyncs(true);
+ 
  	/*
  	 * Mark serializable transaction as complete for predicate locking
  	 * purposes.  This should be done as late as we can put it and still allow
***************
*** 2235,2240 **** PrepareTransaction(void)
--- 2238,2246 ----
  	/* close large objects before lower-level cleanup */
  	AtEOXact_LargeObject(true);
  
+ 	/* Flush updates to relations that we didn't WAL-logged */
+ 	smgrDoPendingSyncs(true);
+ 
  	/*
  	 * Mark serializable transaction as complete for predicate locking
  	 * purposes.  This should be done as late as we can put it and still allow
***************
*** 2548,2553 **** AbortTransaction(void)
--- 2554,2560 ----
  	AtAbort_Notify();
  	AtEOXact_RelationMap(false);
  	AtAbort_Twophase();
+ 	smgrDoPendingSyncs(false);	/* abandone pending syncs */
  
  	/*
  	 * Advertise the fact that we aborted in pg_xact (assuming that we got as
*** a/src/backend/catalog/storage.c
--- b/src/backend/catalog/storage.c
***************
*** 29,34 ****
--- 29,35 ----
  #include "catalog/storage_xlog.h"
  #include "storage/freespace.h"
  #include "storage/smgr.h"
+ #include "utils/hsearch.h"
  #include "utils/memutils.h"
  #include "utils/rel.h"
  
***************
*** 64,69 **** typedef struct PendingRelDelete
--- 65,113 ----
  static PendingRelDelete *pendingDeletes = NULL; /* head of linked list */
  
  /*
+  * We also track relation files (RelFileNode values) that have been created
+  * in the same transaction, and that have been modified without WAL-logging
+  * the action (an optimization possible with wal_level=minimal). When we are
+  * about to skip WAL-logging, a PendingRelSync entry is created, and
+  * 'sync_above' is set to the current size of the relation. Any operations
+  * on blocks < sync_above need to be WAL-logged as usual, but for operations
+  * on higher blocks, WAL-logging is skipped.
+  *
+  * NB: after WAL-logging has been skipped for a block, we must not WAL-log
+  * any subsequent actions on the same block either. Replaying the WAL record
+  * of the subsequent action might fail otherwise, as the "before" state of
+  * the block might not match, as the earlier actions were not WAL-logged.
+  * Likewise, after we have WAL-logged an operation for a block, we must
+  * WAL-log any subsequent operations on the same page as well. Replaying
+  * a possible full-page-image from the earlier WAL record would otherwise
+  * revert the page to the old state, even if we sync the relation at end
+  * of transaction.
+  *
+  * If a relation is truncated (without creating a new relfilenode), and we
+  * emit a WAL record of the truncation, we can't skip WAL-logging for any
+  * of the truncated blocks anymore, as replaying the truncation record will
+  * destroy all the data inserted after that. But if we have already decided
+  * to skip WAL-logging changes to a relation, and the relation is truncated,
+  * we don't need to WAL-log the truncation either.
+  *
+  * This mechanism is currently only used by heaps. Indexes are always
+  * WAL-logged. Also, this only applies for wal_level=minimal; with higher
+  * WAL levels we need the WAL for PITR/replication anyway.
+  */
+ typedef struct PendingRelSync
+ {
+ 	RelFileNode relnode;		/* relation created in same xact */
+ 	BlockNumber sync_above;		/* WAL-logging skipped for blocks >=
+ 								 * sync_above */
+ 	BlockNumber truncated_to;	/* truncation WAL record was written */
+ }	PendingRelSync;
+ 
+ /* Relations that need to be fsync'd at commit */
+ static HTAB *pendingSyncs = NULL;
+ 
+ static void createPendingSyncsHash(void);
+ 
+ /*
   * RelationCreateStorage
   *		Create physical storage for a relation.
   *
***************
*** 226,231 **** RelationPreserveStorage(RelFileNode rnode, bool atCommit)
--- 270,277 ----
  void
  RelationTruncate(Relation rel, BlockNumber nblocks)
  {
+ 	PendingRelSync *pending = NULL;
+ 	bool		found;
  	bool		fsm;
  	bool		vm;
  
***************
*** 260,296 **** RelationTruncate(Relation rel, BlockNumber nblocks)
  	 */
  	if (RelationNeedsWAL(rel))
  	{
! 		/*
! 		 * Make an XLOG entry reporting the file truncation.
! 		 */
! 		XLogRecPtr	lsn;
! 		xl_smgr_truncate xlrec;
! 
! 		xlrec.blkno = nblocks;
! 		xlrec.rnode = rel->rd_node;
! 		xlrec.flags = SMGR_TRUNCATE_ALL;
! 
! 		XLogBeginInsert();
! 		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
! 
! 		lsn = XLogInsert(RM_SMGR_ID,
! 						 XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
! 
! 		/*
! 		 * Flush, because otherwise the truncation of the main relation might
! 		 * hit the disk before the WAL record, and the truncation of the FSM
! 		 * or visibility map. If we crashed during that window, we'd be left
! 		 * with a truncated heap, but the FSM or visibility map would still
! 		 * contain entries for the non-existent heap pages.
! 		 */
! 		if (fsm || vm)
! 			XLogFlush(lsn);
  	}
  
  	/* Do the real work */
  	smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
  }
  
  /*
   *	smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
   *
--- 306,386 ----
  	 */
  	if (RelationNeedsWAL(rel))
  	{
! 		/* no_pending_sync is ignored since new entry is created here */
! 		if (!rel->pending_sync)
! 		{
! 			if (!pendingSyncs)
! 				createPendingSyncsHash();
! 			elog(DEBUG2, "RelationTruncate: accessing hash");
! 			pending = (PendingRelSync *) hash_search(pendingSyncs,
! 												 (void *) &rel->rd_node,
! 												 HASH_ENTER, &found);
! 			if (!found)
! 			{
! 				pending->sync_above = InvalidBlockNumber;
! 				pending->truncated_to = InvalidBlockNumber;
! 			}
! 
! 			rel->no_pending_sync= false;
! 			rel->pending_sync = pending;
! 		}
! 
! 		if (rel->pending_sync->sync_above == InvalidBlockNumber ||
! 			rel->pending_sync->sync_above < nblocks)
! 		{
! 			/*
! 			 * Make an XLOG entry reporting the file truncation.
! 			 */
! 			XLogRecPtr		lsn;
! 			xl_smgr_truncate xlrec;
! 
! 			xlrec.blkno = nblocks;
! 			xlrec.rnode = rel->rd_node;
! 
! 			XLogBeginInsert();
! 			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
! 
! 			lsn = XLogInsert(RM_SMGR_ID,
! 							 XLOG_SMGR_TRUNCATE | XLR_SPECIAL_REL_UPDATE);
! 
! 			elog(DEBUG2, "WAL-logged truncation of rel %u/%u/%u to %u blocks",
! 				 rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
! 				 nblocks);
! 
! 			/*
! 			 * Flush, because otherwise the truncation of the main relation
! 			 * might hit the disk before the WAL record, and the truncation of
! 			 * the FSM or visibility map. If we crashed during that window,
! 			 * we'd be left with a truncated heap, but the FSM or visibility
! 			 * map would still contain entries for the non-existent heap
! 			 * pages.
! 			 */
! 			if (fsm || vm)
! 				XLogFlush(lsn);
! 
! 			rel->pending_sync->truncated_to = nblocks;
! 		}
  	}
  
  	/* Do the real work */
  	smgrtruncate(rel->rd_smgr, MAIN_FORKNUM, nblocks);
  }
  
+ /* create the hash table to track pending at-commit fsyncs */
+ static void
+ createPendingSyncsHash(void)
+ {
+ 	/* First time through: initialize the hash table */
+ 	HASHCTL		ctl;
+ 
+ 	MemSet(&ctl, 0, sizeof(ctl));
+ 	ctl.keysize = sizeof(RelFileNode);
+ 	ctl.entrysize = sizeof(PendingRelSync);
+ 	ctl.hash = tag_hash;
+ 	pendingSyncs = hash_create("pending relation sync table", 5,
+ 							   &ctl, HASH_ELEM | HASH_FUNCTION);
+ }
+ 
  /*
   *	smgrDoPendingDeletes() -- Take care of relation deletes at end of xact.
   *
***************
*** 369,374 **** smgrDoPendingDeletes(bool isCommit)
--- 459,482 ----
  }
  
  /*
+  * RelationRemovePendingSync() -- remove pendingSync entry for a relation
+  */
+ void
+ RelationRemovePendingSync(Relation rel)
+ {
+ 	bool found;
+ 
+ 	rel->pending_sync = NULL;
+ 	rel->no_pending_sync = true;
+ 	if (pendingSyncs)
+ 	{
+ 		elog(DEBUG2, "RelationRemovePendingSync: accessing hash");
+ 		hash_search(pendingSyncs, (void *) &rel->rd_node, HASH_REMOVE, &found);
+ 	}
+ }
+ 
+ 
+ /*
   * smgrGetPendingDeletes() -- Get a list of non-temp relations to be deleted.
   *
   * The return value is the number of relations scheduled for termination.
***************
*** 419,424 **** smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr)
--- 527,696 ----
  	return nrels;
  }
  
+ 
+ /*
+  * Remember that the given relation needs to be sync'd at commit, because we
+  * are going to skip WAL-logging subsequent actions to it.
+  */
+ void
+ RecordPendingSync(Relation rel)
+ {
+ 	bool found = true;
+ 	BlockNumber nblocks;
+ 
+ 	Assert(RelationNeedsWAL(rel));
+ 
+ 	/* ignore no_pending_sync since new entry is created here */
+ 	if (!rel->pending_sync)
+ 	{
+ 		if (!pendingSyncs)
+ 			createPendingSyncsHash();
+ 
+ 		/* Look up or create an entry */
+ 		rel->no_pending_sync = false;
+ 		elog(DEBUG2, "RecordPendingSync: accessing hash");
+ 		rel->pending_sync =
+ 			(PendingRelSync *) hash_search(pendingSyncs,
+ 										   (void *) &rel->rd_node,
+ 										   HASH_ENTER, &found);
+ 	}
+ 
+ 	nblocks = RelationGetNumberOfBlocks(rel);
+ 	if (!found)
+ 	{
+ 		rel->pending_sync->truncated_to = InvalidBlockNumber;
+ 		rel->pending_sync->sync_above = nblocks;
+ 
+ 		elog(DEBUG2,
+ 			 "registering new pending sync for rel %u/%u/%u at block %u",
+ 			 rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ 			 nblocks);
+ 
+ 	}
+ 	else if (rel->pending_sync->sync_above == InvalidBlockNumber)
+ 	{
+ 		elog(DEBUG2, "registering pending sync for rel %u/%u/%u at block %u",
+ 			 rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ 			 nblocks);
+ 		rel->pending_sync->sync_above = nblocks;
+ 	}
+ 	else
+ 		elog(DEBUG2,
+ 			 "pending sync for rel %u/%u/%u was already registered at block %u (new %u)",
+ 			 rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ 			 rel->pending_sync->sync_above, nblocks);
+ }
+ 
+ /*
+  * Do changes to given heap page need to be WAL-logged?
+  *
+  * This takes into account any previous RecordPendingSync() requests.
+  *
+  * Note that it is required to check this before creating any WAL records for
+  * heap pages - it is not merely an optimization! WAL-logging a record, when
+  * we have already skipped a previous WAL record for the same page could lead
+  * to failure at WAL replay, as the "before" state expected by the record
+  * might not match what's on disk. Also, if the heap was truncated earlier, we
+  * must WAL-log any changes to the once-truncated blocks, because replaying
+  * the truncation record will destroy them.
+  */
+ bool
+ BufferNeedsWAL(Relation rel, Buffer buf)
+ {
+ 	BlockNumber blkno = InvalidBlockNumber;
+ 
+ 	if (!RelationNeedsWAL(rel))
+ 		return false;
+ 
+ 	elog(DEBUG2, "BufferNeedsWAL(r %d, b %d): hash = %p, ent=%p, neg = %d", rel->rd_id, BufferGetBlockNumber(buf), pendingSyncs, rel->pending_sync, rel->no_pending_sync);
+ 	/* no further work if we know that we don't have pending sync */
+ 	if (!pendingSyncs || rel->no_pending_sync)
+ 		return true;
+ 
+ 	/* do the real work */
+ 	if (!rel->pending_sync)
+ 	{
+ 		bool found = false;
+ 
+ 		/*
+ 		 * Hold the entry in rel. This relies on the fact that hash entry
+ 		 * never moves.
+ 		 */
+ 		rel->pending_sync =
+ 			(PendingRelSync *) hash_search(pendingSyncs,
+ 										   (void *) &rel->rd_node,
+ 										   HASH_FIND, &found);
+ 		elog(DEBUG2, "BufferNeedsWAL: accessing hash : %s", found ? "found" : "not found");
+ 		if (!found)
+ 		{
+ 			/* we don't have no one. don't access the hash no longer */
+ 			rel->no_pending_sync = true;
+ 			return true;
+ 		}
+ 	}
+ 
+ 	blkno = BufferGetBlockNumber(buf);
+ 	if (rel->pending_sync->sync_above == InvalidBlockNumber ||
+ 		rel->pending_sync->sync_above > blkno)
+ 	{
+ 		elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because sync_above is %u",
+ 			 rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ 			 blkno, rel->pending_sync->sync_above);
+ 		return true;
+ 	}
+ 
+ 	/*
+ 	 * We have emitted a truncation record for this block.
+ 	 */
+ 	if (rel->pending_sync->truncated_to != InvalidBlockNumber &&
+ 		rel->pending_sync->truncated_to <= blkno)
+ 	{
+ 		elog(DEBUG2, "not skipping WAL-logging for rel %u/%u/%u block %u, because it was truncated earlier in the same xact",
+ 			 rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ 			 blkno);
+ 		return true;
+ 	}
+ 
+ 	elog(DEBUG2, "skipping WAL-logging for rel %u/%u/%u block %u",
+ 		 rel->rd_node.spcNode, rel->rd_node.dbNode, rel->rd_node.relNode,
+ 		 blkno);
+ 
+ 	return false;
+ }
+ 
+ /*
+  * Sync to disk any relations that we skipped WAL-logging for earlier.
+  */
+ void
+ smgrDoPendingSyncs(bool isCommit)
+ {
+ 	if (!pendingSyncs)
+ 		return;
+ 
+ 	if (isCommit)
+ 	{
+ 		HASH_SEQ_STATUS status;
+ 		PendingRelSync *pending;
+ 
+ 		hash_seq_init(&status, pendingSyncs);
+ 
+ 		while ((pending = hash_seq_search(&status)) != NULL)
+ 		{
+ 			if (pending->sync_above != InvalidBlockNumber)
+ 			{
+ 				FlushRelationBuffersWithoutRelCache(pending->relnode, false);
+ 				smgrimmedsync(smgropen(pending->relnode, InvalidBackendId), MAIN_FORKNUM);
+ 
+ 				elog(DEBUG2, "syncing rel %u/%u/%u", pending->relnode.spcNode,
+ 					 pending->relnode.dbNode, pending->relnode.relNode);
+ 			}
+ 		}
+ 	}
+ 
+ 	hash_destroy(pendingSyncs);
+ 	pendingSyncs = NULL;
+ }
+ 
  /*
   *	PostPrepare_smgr -- Clean up after a successful PREPARE
   *
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
***************
*** 2347,2354 **** CopyFrom(CopyState cstate)
  	 *	- data is being written to relfilenode created in this transaction
  	 * then we can skip writing WAL.  It's safe because if the transaction
  	 * doesn't commit, we'll discard the table (or the new relfilenode file).
! 	 * If it does commit, we'll have done the heap_sync at the bottom of this
! 	 * routine first.
  	 *
  	 * As mentioned in comments in utils/rel.h, the in-same-transaction test
  	 * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
--- 2347,2353 ----
  	 *	- data is being written to relfilenode created in this transaction
  	 * then we can skip writing WAL.  It's safe because if the transaction
  	 * doesn't commit, we'll discard the table (or the new relfilenode file).
! 	 * If it does commit, commit will do heap_sync().
  	 *
  	 * As mentioned in comments in utils/rel.h, the in-same-transaction test
  	 * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
***************
*** 2380,2386 **** CopyFrom(CopyState cstate)
  	{
  		hi_options |= HEAP_INSERT_SKIP_FSM;
  		if (!XLogIsNeeded())
! 			hi_options |= HEAP_INSERT_SKIP_WAL;
  	}
  
  	/*
--- 2379,2385 ----
  	{
  		hi_options |= HEAP_INSERT_SKIP_FSM;
  		if (!XLogIsNeeded())
! 			heap_register_sync(cstate->rel);
  	}
  
  	/*
***************
*** 2862,2872 **** CopyFrom(CopyState cstate)
  	FreeExecutorState(estate);
  
  	/*
! 	 * If we skipped writing WAL, then we need to sync the heap (but not
! 	 * indexes since those use WAL anyway)
  	 */
- 	if (hi_options & HEAP_INSERT_SKIP_WAL)
- 		heap_sync(cstate->rel);
  
  	return processed;
  }
--- 2861,2871 ----
  	FreeExecutorState(estate);
  
  	/*
! 	 * If we skipped writing WAL, then we will sync the heap at the end of
! 	 * the transaction. (We used to do it here, but it was later found out
! 	 * that to be safe, we must also avoid WAL-logging any subsequent
! 	 * actions on the pages we skipped WAL for). Indexes always use WAL.
  	 */
  
  	return processed;
  }
*** a/src/backend/commands/createas.c
--- b/src/backend/commands/createas.c
***************
*** 567,574 **** intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
  	 * We can skip WAL-logging the insertions, unless PITR or streaming
  	 * replication is in use. We can skip the FSM in any case.
  	 */
! 	myState->hi_options = HEAP_INSERT_SKIP_FSM |
! 		(XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
  	myState->bistate = GetBulkInsertState();
  
  	/* Not using WAL requires smgr_targblock be initially invalid */
--- 567,575 ----
  	 * We can skip WAL-logging the insertions, unless PITR or streaming
  	 * replication is in use. We can skip the FSM in any case.
  	 */
! 	if (!XLogIsNeeded())
! 		heap_register_sync(intoRelationDesc);
! 	myState->hi_options = HEAP_INSERT_SKIP_FSM;
  	myState->bistate = GetBulkInsertState();
  
  	/* Not using WAL requires smgr_targblock be initially invalid */
***************
*** 617,625 **** intorel_shutdown(DestReceiver *self)
  
  	FreeBulkInsertState(myState->bistate);
  
! 	/* If we skipped using WAL, must heap_sync before commit */
! 	if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
! 		heap_sync(myState->rel);
  
  	/* close rel, but keep lock until commit */
  	heap_close(myState->rel, NoLock);
--- 618,624 ----
  
  	FreeBulkInsertState(myState->bistate);
  
! 	/* If we skipped using WAL, we will sync the relation at commit */
  
  	/* close rel, but keep lock until commit */
  	heap_close(myState->rel, NoLock);
*** a/src/backend/commands/matview.c
--- b/src/backend/commands/matview.c
***************
*** 477,483 **** transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
  	 */
  	myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
  	if (!XLogIsNeeded())
! 		myState->hi_options |= HEAP_INSERT_SKIP_WAL;
  	myState->bistate = GetBulkInsertState();
  
  	/* Not using WAL requires smgr_targblock be initially invalid */
--- 477,483 ----
  	 */
  	myState->hi_options = HEAP_INSERT_SKIP_FSM | HEAP_INSERT_FROZEN;
  	if (!XLogIsNeeded())
! 		heap_register_sync(transientrel);
  	myState->bistate = GetBulkInsertState();
  
  	/* Not using WAL requires smgr_targblock be initially invalid */
***************
*** 520,528 **** transientrel_shutdown(DestReceiver *self)
  
  	FreeBulkInsertState(myState->bistate);
  
! 	/* If we skipped using WAL, must heap_sync before commit */
! 	if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
! 		heap_sync(myState->transientrel);
  
  	/* close transientrel, but keep lock until commit */
  	heap_close(myState->transientrel, NoLock);
--- 520,526 ----
  
  	FreeBulkInsertState(myState->bistate);
  
! 	/* If we skipped using WAL, we will sync the relation at commit */
  
  	/* close transientrel, but keep lock until commit */
  	heap_close(myState->transientrel, NoLock);
*** a/src/backend/commands/tablecmds.c
--- b/src/backend/commands/tablecmds.c
***************
*** 4357,4364 **** ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
  		bistate = GetBulkInsertState();
  
  		hi_options = HEAP_INSERT_SKIP_FSM;
  		if (!XLogIsNeeded())
! 			hi_options |= HEAP_INSERT_SKIP_WAL;
  	}
  	else
  	{
--- 4357,4365 ----
  		bistate = GetBulkInsertState();
  
  		hi_options = HEAP_INSERT_SKIP_FSM;
+ 
  		if (!XLogIsNeeded())
! 			heap_register_sync(newrel);
  	}
  	else
  	{
***************
*** 4624,4631 **** ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
  		FreeBulkInsertState(bistate);
  
  		/* If we skipped writing WAL, then we need to sync the heap. */
- 		if (hi_options & HEAP_INSERT_SKIP_WAL)
- 			heap_sync(newrel);
  
  		heap_close(newrel, NoLock);
  	}
--- 4625,4630 ----
*** a/src/backend/commands/vacuumlazy.c
--- b/src/backend/commands/vacuumlazy.c
***************
*** 891,897 **** lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
  				 * page has been previously WAL-logged, and if not, do that
  				 * now.
  				 */
! 				if (RelationNeedsWAL(onerel) &&
  					PageGetLSN(page) == InvalidXLogRecPtr)
  					log_newpage_buffer(buf, true);
  
--- 891,897 ----
  				 * page has been previously WAL-logged, and if not, do that
  				 * now.
  				 */
! 				if (BufferNeedsWAL(onerel, buf) &&
  					PageGetLSN(page) == InvalidXLogRecPtr)
  					log_newpage_buffer(buf, true);
  
***************
*** 1118,1124 **** lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
  			}
  
  			/* Now WAL-log freezing if necessary */
! 			if (RelationNeedsWAL(onerel))
  			{
  				XLogRecPtr	recptr;
  
--- 1118,1124 ----
  			}
  
  			/* Now WAL-log freezing if necessary */
! 			if (BufferNeedsWAL(onerel, buf))
  			{
  				XLogRecPtr	recptr;
  
***************
*** 1476,1482 **** lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer,
  	MarkBufferDirty(buffer);
  
  	/* XLOG stuff */
! 	if (RelationNeedsWAL(onerel))
  	{
  		XLogRecPtr	recptr;
  
--- 1476,1482 ----
  	MarkBufferDirty(buffer);
  
  	/* XLOG stuff */
! 	if (BufferNeedsWAL(onerel, buffer))
  	{
  		XLogRecPtr	recptr;
  
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 451,456 **** static BufferDesc *BufferAlloc(SMgrRelation smgr,
--- 451,457 ----
  			BufferAccessStrategy strategy,
  			bool *foundPtr);
  static void FlushBuffer(BufferDesc *buf, SMgrRelation reln);
+ static void FlushRelationBuffers_common(SMgrRelation smgr, bool islocal);
  static void AtProcExit_Buffers(int code, Datum arg);
  static void CheckForBufferLeaks(void);
  static int	rnode_comparator(const void *p1, const void *p2);
***************
*** 3147,3166 **** PrintPinnedBufs(void)
  void
  FlushRelationBuffers(Relation rel)
  {
- 	int			i;
- 	BufferDesc *bufHdr;
- 
  	/* Open rel at the smgr level if not already done */
  	RelationOpenSmgr(rel);
  
! 	if (RelationUsesLocalBuffers(rel))
  	{
  		for (i = 0; i < NLocBuffer; i++)
  		{
  			uint32		buf_state;
  
  			bufHdr = GetLocalBufferDescriptor(i);
! 			if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
  				((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
  				 (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
  			{
--- 3148,3188 ----
  void
  FlushRelationBuffers(Relation rel)
  {
  	/* Open rel at the smgr level if not already done */
  	RelationOpenSmgr(rel);
  
! 	FlushRelationBuffers_common(rel->rd_smgr, RelationUsesLocalBuffers(rel));
! }
! 
! /*
!  * Like FlushRelationBuffers(), but the relation is specified by a
!  * RelFileNode
!  */
! void
! FlushRelationBuffersWithoutRelCache(RelFileNode rnode, bool islocal)
! {
! 	FlushRelationBuffers_common(smgropen(rnode, InvalidBackendId), islocal);
! }
! 
! /*
!  * Code shared between functions FlushRelationBuffers() and
!  * FlushRelationBuffersWithoutRelCache().
!  */
! static void
! FlushRelationBuffers_common(SMgrRelation smgr, bool islocal)
! {
! 	RelFileNode rnode = smgr->smgr_rnode.node;
! 	int			i;
! 	BufferDesc *bufHdr;
! 
! 	if (islocal)
  	{
  		for (i = 0; i < NLocBuffer; i++)
  		{
  			uint32		buf_state;
  
  			bufHdr = GetLocalBufferDescriptor(i);
! 			if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
  				((buf_state = pg_atomic_read_u32(&bufHdr->state)) &
  				 (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
  			{
***************
*** 3177,3183 **** FlushRelationBuffers(Relation rel)
  
  				PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
  
! 				smgrwrite(rel->rd_smgr,
  						  bufHdr->tag.forkNum,
  						  bufHdr->tag.blockNum,
  						  localpage,
--- 3199,3205 ----
  
  				PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
  
! 				smgrwrite(smgr,
  						  bufHdr->tag.forkNum,
  						  bufHdr->tag.blockNum,
  						  localpage,
***************
*** 3207,3224 **** FlushRelationBuffers(Relation rel)
  		 * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
  		 * and saves some cycles.
  		 */
! 		if (!RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node))
  			continue;
  
  		ReservePrivateRefCountEntry();
  
  		buf_state = LockBufHdr(bufHdr);
! 		if (RelFileNodeEquals(bufHdr->tag.rnode, rel->rd_node) &&
  			(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
  		{
  			PinBuffer_Locked(bufHdr);
  			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
! 			FlushBuffer(bufHdr, rel->rd_smgr);
  			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
  			UnpinBuffer(bufHdr, true);
  		}
--- 3229,3246 ----
  		 * As in DropRelFileNodeBuffers, an unlocked precheck should be safe
  		 * and saves some cycles.
  		 */
! 		if (!RelFileNodeEquals(bufHdr->tag.rnode, rnode))
  			continue;
  
  		ReservePrivateRefCountEntry();
  
  		buf_state = LockBufHdr(bufHdr);
! 		if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
  			(buf_state & (BM_VALID | BM_DIRTY)) == (BM_VALID | BM_DIRTY))
  		{
  			PinBuffer_Locked(bufHdr);
  			LWLockAcquire(BufferDescriptorGetContentLock(bufHdr), LW_SHARED);
! 			FlushBuffer(bufHdr, smgr);
  			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
  			UnpinBuffer(bufHdr, true);
  		}
*** a/src/backend/utils/cache/relcache.c
--- b/src/backend/utils/cache/relcache.c
***************
*** 72,77 ****
--- 72,78 ----
  #include "optimizer/var.h"
  #include "rewrite/rewriteDefine.h"
  #include "rewrite/rowsecurity.h"
+ #include "storage/bufmgr.h"
  #include "storage/lmgr.h"
  #include "storage/smgr.h"
  #include "utils/array.h"
***************
*** 418,423 **** AllocateRelationDesc(Form_pg_class relp)
--- 419,428 ----
  	/* which we mark as a reference-counted tupdesc */
  	relation->rd_att->tdrefcount = 1;
  
+ 	/* We don't know if pending sync for this relation exists so far */
+ 	relation->pending_sync = NULL;
+ 	relation->no_pending_sync = false;
+ 
  	MemoryContextSwitchTo(oldcxt);
  
  	return relation;
***************
*** 2040,2045 **** formrdesc(const char *relationName, Oid relationReltype,
--- 2045,2054 ----
  		relation->rd_rel->relhasindex = true;
  	}
  
+ 	/* We don't know if pending sync for this relation exists so far */
+ 	relation->pending_sync = NULL;
+ 	relation->no_pending_sync = false;
+ 
  	/*
  	 * add new reldesc to relcache
  	 */
***************
*** 3364,3369 **** RelationBuildLocalRelation(const char *relname,
--- 3373,3382 ----
  	else
  		rel->rd_rel->relfilenode = relfilenode;
  
+ 	/* newly built relation has no pending sync */
+ 	rel->no_pending_sync = true;
+ 	rel->pending_sync = NULL;
+ 
  	RelationInitLockInfo(rel);	/* see lmgr.c */
  
  	RelationInitPhysicalAddr(rel);
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
***************
*** 25,34 ****
  
  
  /* "options" flag bits for heap_insert */
! #define HEAP_INSERT_SKIP_WAL	0x0001
! #define HEAP_INSERT_SKIP_FSM	0x0002
! #define HEAP_INSERT_FROZEN		0x0004
! #define HEAP_INSERT_SPECULATIVE 0x0008
  
  typedef struct BulkInsertStateData *BulkInsertState;
  
--- 25,33 ----
  
  
  /* "options" flag bits for heap_insert */
! #define HEAP_INSERT_SKIP_FSM	0x0001
! #define HEAP_INSERT_FROZEN		0x0002
! #define HEAP_INSERT_SPECULATIVE 0x0004
  
  typedef struct BulkInsertStateData *BulkInsertState;
  
***************
*** 179,184 **** extern void simple_heap_delete(Relation relation, ItemPointer tid);
--- 178,184 ----
  extern void simple_heap_update(Relation relation, ItemPointer otid,
  				   HeapTuple tup);
  
+ extern void heap_register_sync(Relation relation);
  extern void heap_sync(Relation relation);
  extern void heap_update_snapshot(HeapScanDesc scan, Snapshot snapshot);
  
*** a/src/include/catalog/storage.h
--- b/src/include/catalog/storage.h
***************
*** 22,34 **** extern void RelationCreateStorage(RelFileNode rnode, char relpersistence);
  extern void RelationDropStorage(Relation rel);
  extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
  extern void RelationTruncate(Relation rel, BlockNumber nblocks);
! 
  /*
   * These functions used to be in storage/smgr/smgr.c, which explains the
   * naming
   */
  extern void smgrDoPendingDeletes(bool isCommit);
  extern int	smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
  extern void AtSubCommit_smgr(void);
  extern void AtSubAbort_smgr(void);
  extern void PostPrepare_smgr(void);
--- 22,37 ----
  extern void RelationDropStorage(Relation rel);
  extern void RelationPreserveStorage(RelFileNode rnode, bool atCommit);
  extern void RelationTruncate(Relation rel, BlockNumber nblocks);
! extern void RelationRemovePendingSync(Relation rel);
  /*
   * These functions used to be in storage/smgr/smgr.c, which explains the
   * naming
   */
  extern void smgrDoPendingDeletes(bool isCommit);
  extern int	smgrGetPendingDeletes(bool forCommit, RelFileNode **ptr);
+ extern void smgrDoPendingSyncs(bool isCommit);
+ extern void RecordPendingSync(Relation rel);
+ bool BufferNeedsWAL(Relation rel, Buffer buf);
  extern void AtSubCommit_smgr(void);
  extern void AtSubAbort_smgr(void);
  extern void PostPrepare_smgr(void);
*** a/src/include/storage/bufmgr.h
--- b/src/include/storage/bufmgr.h
***************
*** 190,195 **** extern BlockNumber RelationGetNumberOfBlocksInFork(Relation relation,
--- 190,197 ----
  								ForkNumber forkNum);
  extern void FlushOneBuffer(Buffer buffer);
  extern void FlushRelationBuffers(Relation rel);
+ extern void FlushRelationBuffersWithoutRelCache(RelFileNode rnode,
+ 									bool islocal);
  extern void FlushDatabaseBuffers(Oid dbid);
  extern void DropRelFileNodeBuffers(RelFileNodeBackend rnode,
  					   ForkNumber forkNum, BlockNumber firstDelBlock);
*** a/src/include/utils/rel.h
--- b/src/include/utils/rel.h
***************
*** 216,221 **** typedef struct RelationData
--- 216,229 ----
  
  	/* use "struct" here to avoid needing to include pgstat.h: */
  	struct PgStat_TableStatus *pgstat_info; /* statistics collection area */
+ 
+ 	/*
+ 	 * no_pending_sync is true if this relation is known not to have pending
+ 	 * syncs.  Elsewise searching for registered sync is required if
+ 	 * pending_sync is NULL.
+ 	 */
+ 	bool				   no_pending_sync;
+ 	struct PendingRelSync *pending_sync;
  } RelationData;

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL logging problem in 9.4.3?

Reply via email to