On Fri, Dec 4, 2015 at 5:10 PM, Andres Freund <and...@anarazel.de> wrote:
> On 2015-12-04 17:00:13 +0900, Michael Paquier wrote:
>> Andres Freud wrote:
>> >>  extern void InitXLogInsert(void);
>> >> diff --git a/src/include/catalog/pg_control.h 
>> >> b/src/include/catalog/pg_control.h
>> >> index ad1eb4b..91445f1 100644
>> >> --- a/src/include/catalog/pg_control.h
>> >> +++ b/src/include/catalog/pg_control.h
>> >> @@ -73,6 +73,7 @@ typedef struct CheckPoint
>> >>  #define XLOG_END_OF_RECOVERY                 0x90
>> >>  #define XLOG_FPI_FOR_HINT                            0xA0
>> >>  #define XLOG_FPI                                             0xB0
>> >> +#define XLOG_FPI_FOR_SYNC                            0xC0
>> >
>> >
>> > I'm not a big fan of the XLOG_FPI_FOR_SYNC name. Syncing is a bit too
>> > ambigous for my taste. How about either naming it XLOG_FPI_FLUSH or
>> > instead adding actual record data and a 'flags' field in there? I
>> > slightly prefer the latter - XLOG_FPI and XLOG_FPI_FOR_HINT really are
>> > different, XLOG_FPI_FOR_SYNC not so much.
>>
>> Let's go for XLOG_FPI_FLUSH.
>
> I think the other way is a bit better, because we can add new flags
> without changing the WAL format.

Hm. On the contrary, I think that it would make more sense to have a
flag as well for FOR_HINT honestly, those are really the same
operations, and FOR_HINT is just here for statistic purposes.

>> diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
>> index 99337b0..b646101 100644
>> --- a/src/backend/access/brin/brin.c
>> +++ b/src/backend/access/brin/brin.c
>> @@ -682,7 +682,15 @@ brinbuildempty(PG_FUNCTION_ARGS)
>>       brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
>>                                          BRIN_CURRENT_VERSION);
>>       MarkBufferDirty(metabuf);
>> -     log_newpage_buffer(metabuf, false);
>> +
>> +     /*
>> +      * For unlogged relations, this page should be immediately flushed
>> +      * to disk after being replayed. This is necessary to ensure that the
>> +      * initial on-disk state of unlogged relations is preserved when
>> +      * they get reset at the end of recovery.
>> +      */
>> +     log_newpage_buffer(metabuf, false,
>> +             index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
>>       END_CRIT_SECTION();
>
> Maybe write the last sentence as '... as the on disk files are copied
> directly at the end of recovery.'?

Check.

>> @@ -336,7 +336,8 @@ end_heap_rewrite(RewriteState state)
>>                                               MAIN_FORKNUM,
>>                                               state->rs_blockno,
>>                                               state->rs_buffer,
>> -                                             true);
>> +                                             true,
>> +                                             false);
>>               RelationOpenSmgr(state->rs_new_rel);
>>
>>               PageSetChecksumInplace(state->rs_buffer, state->rs_blockno);
>> @@ -685,7 +686,8 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
>>                                                       MAIN_FORKNUM,
>>                                                       state->rs_blockno,
>>                                                       page,
>> -                                                     true);
>> +                                                     true,
>> +                                                     false);
>
> Did you verify that that's ok when a unlogged table is clustered/vacuum
> full'ed?

Yep.

>> @@ -181,6 +183,9 @@ xlog_identify(uint8 info)
>>               case XLOG_FPI_FOR_HINT:
>>                       id = "FPI_FOR_HINT";
>>                       break;
>> +             case XLOG_FPI_FLUSH:
>> +                     id = "FPI_FOR_SYNC";
>> +                     break;
>>       }
>
> Old string.

Yeah, that's now completely removed.

>> --- a/src/backend/access/transam/xloginsert.c
>> +++ b/src/backend/access/transam/xloginsert.c
>> @@ -932,10 +932,13 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
>>   * If the page follows the standard page layout, with a PageHeader and 
>> unused
>>   * space between pd_lower and pd_upper, set 'page_std' to TRUE. That allows
>>   * the unused space to be left out from the WAL record, making it smaller.
>> + *
>> + * If 'is_flush' is set to TRUE, relation will be requested to flush
>> + * immediately its data at replay after replaying this full page image.
>>   */
>
> s/is_flush/flush_immed/? And maybe say that it 'will be flushed to the
> OS immediately after replaying the record'?

s/OS/stable storage?
-- 
Michael
From a25b938d39fe735cf2c4c46a5c54db762510220c Mon Sep 17 00:00:00 2001
From: Michael Paquier <mich...@otacoo.com>
Date: Fri, 4 Dec 2015 16:58:23 +0900
Subject: [PATCH] Ensure consistent on-disk state of UNLOGGED indexes at
 recovery

Unlogged relation indexes need to have a consistent initial state on-disk
at the time of replay to ensure that their replayed pages are found on disk
should end of recovery happen and subsequently reset those relations. This
commit extends the XLOG record XLOG_FPI with a set of flags aimed at
controlling flushing of the page replayed, at the same time
XLOG_FOR_HINT_BITS is merged with it.

All types of relation indexes whose persistence is unlogged are impacted
by the bug this commit fixes, with various degrees of problems, most of
them causing errors on promoted standbys when trying to INSERT new tuples
to their parent relations. The worst problem found was with GIN indexes,
where trying to insert a new tuple in it caused the system to remain stuck
on a semaphore lock, making the system unresponsive.
---
 src/backend/access/brin/brin.c           | 10 +++++++-
 src/backend/access/brin/brin_pageops.c   |  2 +-
 src/backend/access/gin/gininsert.c       | 14 ++++++++---
 src/backend/access/gist/gist.c           |  3 ++-
 src/backend/access/heap/rewriteheap.c    |  6 +++--
 src/backend/access/nbtree/nbtree.c       |  2 +-
 src/backend/access/nbtree/nbtsort.c      |  3 ++-
 src/backend/access/rmgrdesc/xlogdesc.c   | 12 ++++++----
 src/backend/access/spgist/spginsert.c    |  6 ++---
 src/backend/access/transam/xlog.c        | 41 +++++++++++++++++++++++++-------
 src/backend/access/transam/xloginsert.c  | 35 +++++++++++++++++++--------
 src/backend/commands/tablecmds.c         | 15 ++++++++----
 src/backend/commands/vacuumlazy.c        |  2 +-
 src/backend/replication/logical/decode.c |  1 -
 src/include/access/xlog_internal.h       |  8 +++++++
 src/include/access/xloginsert.h          |  6 +++--
 src/include/catalog/pg_control.h         |  3 +--
 17 files changed, 122 insertions(+), 47 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 99337b0..fff48ab 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -682,7 +682,15 @@ brinbuildempty(PG_FUNCTION_ARGS)
 	brin_metapage_init(BufferGetPage(metabuf), BrinGetPagesPerRange(index),
 					   BRIN_CURRENT_VERSION);
 	MarkBufferDirty(metabuf);
-	log_newpage_buffer(metabuf, false);
+
+	/*
+	 * For unlogged relations, this page should be immediately flushed
+	 * to disk after being replayed. This is necessary to ensure that the
+	 * initial on-disk state of unlogged relations is preserved as the
+	 * on-disk files are copied directly at the end of recovery.
+	 */
+	log_newpage_buffer(metabuf, false,
+		index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
 	END_CRIT_SECTION();
 
 	UnlockReleaseBuffer(metabuf);
diff --git a/src/backend/access/brin/brin_pageops.c b/src/backend/access/brin/brin_pageops.c
index f876f62..572fe20 100644
--- a/src/backend/access/brin/brin_pageops.c
+++ b/src/backend/access/brin/brin_pageops.c
@@ -865,7 +865,7 @@ brin_initialize_empty_new_buffer(Relation idxrel, Buffer buffer)
 	page = BufferGetPage(buffer);
 	brin_page_init(page, BRIN_PAGETYPE_REGULAR);
 	MarkBufferDirty(buffer);
-	log_newpage_buffer(buffer, true);
+	log_newpage_buffer(buffer, true, false);
 	END_CRIT_SECTION();
 
 	/*
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 49e9185..17c168a 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -450,14 +450,22 @@ ginbuildempty(PG_FUNCTION_ARGS)
 		ReadBufferExtended(index, INIT_FORKNUM, P_NEW, RBM_NORMAL, NULL);
 	LockBuffer(RootBuffer, BUFFER_LOCK_EXCLUSIVE);
 
-	/* Initialize and xlog metabuffer and root buffer. */
+	/*
+	 * Initialize and xlog metabuffer and root buffer. For unlogged
+	 * relations, those pages need to be immediately flushed to disk
+	 * after being replayed. This is necessary to ensure that the
+	 * initial on-disk state of unlogged relations is preserved when
+	 * they get reset at the end of recovery.
+	 */
 	START_CRIT_SECTION();
 	GinInitMetabuffer(MetaBuffer);
 	MarkBufferDirty(MetaBuffer);
-	log_newpage_buffer(MetaBuffer, false);
+	log_newpage_buffer(MetaBuffer, false,
+		index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
 	GinInitBuffer(RootBuffer, GIN_LEAF);
 	MarkBufferDirty(RootBuffer);
-	log_newpage_buffer(RootBuffer, false);
+	log_newpage_buffer(RootBuffer, false,
+		index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
 	END_CRIT_SECTION();
 
 	/* Unlock and release the buffers. */
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 53bccf6..6a20031 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -84,7 +84,8 @@ gistbuildempty(PG_FUNCTION_ARGS)
 	START_CRIT_SECTION();
 	GISTInitBuffer(buffer, F_LEAF);
 	MarkBufferDirty(buffer);
-	log_newpage_buffer(buffer, true);
+	log_newpage_buffer(buffer, true,
+		index->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED);
 	END_CRIT_SECTION();
 
 	/* Unlock and release the buffer */
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 6a6fc3b..e9a9a8f 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -336,7 +336,8 @@ end_heap_rewrite(RewriteState state)
 						MAIN_FORKNUM,
 						state->rs_blockno,
 						state->rs_buffer,
-						true);
+						true,
+						false);
 		RelationOpenSmgr(state->rs_new_rel);
 
 		PageSetChecksumInplace(state->rs_buffer, state->rs_blockno);
@@ -685,7 +686,8 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 							MAIN_FORKNUM,
 							state->rs_blockno,
 							page,
-							true);
+							true,
+							false);
 
 			/*
 			 * Now write the page. We say isTemp = true even if it's not a
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index cf4a6dc..d211a98 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -206,7 +206,7 @@ btbuildempty(PG_FUNCTION_ARGS)
 			  (char *) metapage, true);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-					BTREE_METAPAGE, metapage, false);
+					BTREE_METAPAGE, metapage, false, true);
 
 	/*
 	 * An immediate sync is required even if we xlog'd the page, because the
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index f95f67a..faf611c 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -277,7 +277,8 @@ _bt_blwritepage(BTWriteState *wstate, Page page, BlockNumber blkno)
 	if (wstate->btws_use_wal)
 	{
 		/* We use the heap NEWPAGE record type for this */
-		log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page, true);
+		log_newpage(&wstate->index->rd_node, MAIN_FORKNUM, blkno, page,
+					true, false);
 	}
 
 	/*
diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 83cc9e8..8f40fe6 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -77,9 +77,14 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 
 		appendStringInfoString(buf, xlrec->rp_name);
 	}
-	else if (info == XLOG_FPI || info == XLOG_FPI_FOR_HINT)
+	else if (info == XLOG_FPI)
 	{
-		/* no further information to print */
+		xl_restore_fpi xlrec;
+
+		memcpy(&xlrec, rec, sizeof(xl_restore_fpi));
+		appendStringInfo(buf, "hint bits=%s flush=%s",
+						 xlrec.for_hint_bits ? "true" : "false",
+						 xlrec.is_flush ? "true" : "false");
 	}
 	else if (info == XLOG_BACKUP_END)
 	{
@@ -178,9 +183,6 @@ xlog_identify(uint8 info)
 		case XLOG_FPI:
 			id = "FPI";
 			break;
-		case XLOG_FPI_FOR_HINT:
-			id = "FPI_FOR_HINT";
-			break;
 	}
 
 	return id;
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index bceee8d..0758bfd 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -173,7 +173,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 			  (char *) page, true);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-					SPGIST_METAPAGE_BLKNO, page, false);
+					SPGIST_METAPAGE_BLKNO, page, false, true);
 
 	/* Likewise for the root page. */
 	SpGistInitPage(page, SPGIST_LEAF);
@@ -183,7 +183,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 			  (char *) page, true);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-					SPGIST_ROOT_BLKNO, page, true);
+					SPGIST_ROOT_BLKNO, page, true, true);
 
 	/* Likewise for the null-tuples root page. */
 	SpGistInitPage(page, SPGIST_LEAF | SPGIST_NULLS);
@@ -193,7 +193,7 @@ spgbuildempty(PG_FUNCTION_ARGS)
 			  (char *) page, true);
 	if (XLogIsNeeded())
 		log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-					SPGIST_NULL_BLKNO, page, true);
+					SPGIST_NULL_BLKNO, page, true, true);
 
 	/*
 	 * An immediate sync is required even if we xlog'd the pages, because the
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 86debf4..a599c2c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9187,8 +9187,8 @@ xlog_redo(XLogReaderState *record)
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
 	XLogRecPtr	lsn = record->EndRecPtr;
 
-	/* in XLOG rmgr, backup blocks are only used by XLOG_FPI records */
-	Assert(info == XLOG_FPI || info == XLOG_FPI_FOR_HINT ||
+	/* in XLOG rmgr, backup blocks are only used by XLOG_FPI record */
+	Assert(info == XLOG_FPI ||
 		   !XLogRecHasAnyBlockRefs(record));
 
 	if (info == XLOG_NEXTOID)
@@ -9378,9 +9378,12 @@ xlog_redo(XLogReaderState *record)
 	{
 		/* nothing to do here */
 	}
-	else if (info == XLOG_FPI || info == XLOG_FPI_FOR_HINT)
+	else if (info == XLOG_FPI)
 	{
-		Buffer		buffer;
+		Buffer			buffer;
+		xl_restore_fpi	xlrec;
+
+		memcpy(&xlrec, XLogRecGetData(record), sizeof(xl_restore_fpi));
 
 		/*
 		 * Full-page image (FPI) records contain nothing else but a backup
@@ -9391,14 +9394,34 @@ xlog_redo(XLogReaderState *record)
 		 * resource manager needs to generate conflicts, it has to define a
 		 * separate WAL record type and redo routine.
 		 *
-		 * XLOG_FPI_FOR_HINT records are generated when a page needs to be
-		 * WAL- logged because of a hint bit update. They are only generated
-		 * when checksums are enabled. There is no difference in handling
-		 * XLOG_FPI and XLOG_FPI_FOR_HINT records, they use a different info
-		 * code just to distinguish them for statistics purposes.
+		 * Records flagged with 'for_hint_bits' are generated when a page needs
+		 * to be WAL- logged because of a hint bit update. They are only
+		 * generated when checksums are enabled. There is no difference in
+		 * handling records when this flag is set, it is used for statistics
+		 * purposes.
+		 *
+		 * Records flagged with 'is_flush' indicate that the page immediately
+		 * needs to be written to disk, not just to shared buffers. This is
+		 * important if the on-disk state is to be the authoritative, not the
+		 * state in shared buffers. E.g. because on-disk files may later be
+		 * copied directly.
 		 */
 		if (XLogReadBufferForRedo(record, 0, &buffer) != BLK_RESTORED)
 			elog(ERROR, "unexpected XLogReadBufferForRedo result when restoring backup block");
+
+		if (xlrec.is_flush)
+		{
+			RelFileNode	rnode;
+			ForkNumber	forknum;
+			BlockNumber	blkno;
+			SMgrRelation srel;
+
+			(void) XLogRecGetBlockTag(record, 0, &rnode, &forknum, &blkno);
+			srel = smgropen(rnode, InvalidBackendId);
+			smgrwrite(srel, forknum, blkno, BufferGetPage(buffer), false);
+			smgrclose(srel);
+		}
+
 		UnlockReleaseBuffer(buffer);
 	}
 	else if (info == XLOG_BACKUP_END)
diff --git a/src/backend/access/transam/xloginsert.c b/src/backend/access/transam/xloginsert.c
index 925255f..6ef257d 100644
--- a/src/backend/access/transam/xloginsert.c
+++ b/src/backend/access/transam/xloginsert.c
@@ -884,9 +884,13 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
 		int			flags;
 		char		copied_buffer[BLCKSZ];
 		char	   *origdata = (char *) BufferGetBlock(buffer);
-		RelFileNode rnode;
-		ForkNumber	forkno;
-		BlockNumber blkno;
+		RelFileNode		rnode;
+		ForkNumber		forkno;
+		BlockNumber		blkno;
+		xl_restore_fpi	xlrec;
+
+		xlrec.for_hint_bits = true;
+		xlrec.is_flush = false;
 
 		/*
 		 * Copy buffer so we don't have to worry about concurrent hint bit or
@@ -907,7 +911,7 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
 			memcpy(copied_buffer, origdata, BLCKSZ);
 
 		XLogBeginInsert();
-
+		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
 		flags = REGBUF_FORCE_IMAGE;
 		if (buffer_std)
 			flags |= REGBUF_STANDARD;
@@ -915,7 +919,7 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
 		BufferGetTag(buffer, &rnode, &forkno, &blkno);
 		XLogRegisterBlock(0, &rnode, forkno, blkno, copied_buffer, flags);
 
-		recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI_FOR_HINT);
+		recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI);
 	}
 
 	return recptr;
@@ -932,19 +936,27 @@ XLogSaveBufferForHint(Buffer buffer, bool buffer_std)
  * If the page follows the standard page layout, with a PageHeader and unused
  * space between pd_lower and pd_upper, set 'page_std' to TRUE. That allows
  * the unused space to be left out from the WAL record, making it smaller.
+ *
+ * If 'is_flush' is set to TRUE, relation will be flushed on stable storage
+ * immediately after replaying the record.
  */
 XLogRecPtr
 log_newpage(RelFileNode *rnode, ForkNumber forkNum, BlockNumber blkno,
-			Page page, bool page_std)
+			Page page, bool page_std, bool is_flush)
 {
-	int			flags;
-	XLogRecPtr	recptr;
+	int				flags;
+	XLogRecPtr		recptr;
+	xl_restore_fpi	xlrec;
 
 	flags = REGBUF_FORCE_IMAGE;
 	if (page_std)
 		flags |= REGBUF_STANDARD;
 
+	xlrec.for_hint_bits = false;
+	xlrec.is_flush = is_flush;
+
 	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, sizeof(xlrec));
 	XLogRegisterBlock(0, rnode, forkNum, blkno, page, flags);
 	recptr = XLogInsert(RM_XLOG_ID, XLOG_FPI);
 
@@ -969,9 +981,12 @@ log_newpage(RelFileNode *rnode, ForkNumber forkNum, BlockNumber blkno,
  * If the page follows the standard page layout, with a PageHeader and unused
  * space between pd_lower and pd_upper, set 'page_std' to TRUE. That allows
  * the unused space to be left out from the WAL record, making it smaller.
+ *
+ * If 'is_flush' is set to TRUE, relation will be requested to flush
+ * immediately its data at replay after replaying this full page image.
  */
 XLogRecPtr
-log_newpage_buffer(Buffer buffer, bool page_std)
+log_newpage_buffer(Buffer buffer, bool page_std, bool is_flush)
 {
 	Page		page = BufferGetPage(buffer);
 	RelFileNode rnode;
@@ -983,7 +998,7 @@ log_newpage_buffer(Buffer buffer, bool page_std)
 
 	BufferGetTag(buffer, &rnode, &forkNum, &blkno);
 
-	return log_newpage(&rnode, forkNum, blkno, page, page_std);
+	return log_newpage(&rnode, forkNum, blkno, page, page_std, is_flush);
 }
 
 /*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 0ddde72..dbe7ed9 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -9892,9 +9892,14 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
 
 	/*
 	 * We need to log the copied data in WAL iff WAL archiving/streaming is
-	 * enabled AND it's a permanent relation.
+	 * enabled AND it's a permanent relation. Unlogged relations need to have
+	 * their INIT_FORKNUM logged as well, and flushed at replay to ensure a
+	 * consistent on-disk state when reset at the end of recovery.
 	 */
-	use_wal = XLogIsNeeded() && relpersistence == RELPERSISTENCE_PERMANENT;
+	use_wal = XLogIsNeeded() &&
+		(relpersistence == RELPERSISTENCE_PERMANENT ||
+		 (relpersistence == RELPERSISTENCE_UNLOGGED &&
+		  forkNum == INIT_FORKNUM));
 
 	nblocks = smgrnblocks(src, forkNum);
 
@@ -9917,10 +9922,12 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
 		/*
 		 * WAL-log the copied page. Unfortunately we don't know what kind of a
 		 * page this is, so we have to log the full page including any unused
-		 * space.
+		 * space. For the same reason, pages part of INIT_FORKNUM are always
+		 * forcibly flushed at replay.
 		 */
 		if (use_wal)
-			log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page, false);
+			log_newpage(&dst->smgr_rnode.node, forkNum, blkno, page,
+						false, forkNum == INIT_FORKNUM);
 
 		PageSetChecksumInplace(page, blkno);
 
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index 2429889..b0e3901 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -736,7 +736,7 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 				 */
 				if (RelationNeedsWAL(onerel) &&
 					PageGetLSN(page) == InvalidXLogRecPtr)
-					log_newpage_buffer(buf, true);
+					log_newpage_buffer(buf, true, false);
 
 				PageSetAllVisible(page);
 				visibilitymap_set(onerel, blkno, buf, InvalidXLogRecPtr,
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 9f60687..07447ec 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -172,7 +172,6 @@ DecodeXLogOp(LogicalDecodingContext *ctx, XLogRecordBuffer *buf)
 		case XLOG_PARAMETER_CHANGE:
 		case XLOG_RESTORE_POINT:
 		case XLOG_FPW_CHANGE:
-		case XLOG_FPI_FOR_HINT:
 		case XLOG_FPI:
 			break;
 		default:
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 86b532d..c2ebeb5 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -226,6 +226,14 @@ typedef struct xl_restore_point
 	char		rp_name[MAXFNAMELEN];
 } xl_restore_point;
 
+/* log of full-page write */
+typedef struct xl_restore_fpi
+{
+	bool	for_hint_bits;	/* image logged because of hint bit update */
+	bool	is_flush;		/* image to be flushed immediately to disk
+							 * after replay */
+} xl_restore_fpi;
+
 /* End of recovery mark, when we don't do an END_OF_RECOVERY checkpoint */
 typedef struct xl_end_of_recovery
 {
diff --git a/src/include/access/xloginsert.h b/src/include/access/xloginsert.h
index 31b45ba..491caa5 100644
--- a/src/include/access/xloginsert.h
+++ b/src/include/access/xloginsert.h
@@ -53,8 +53,10 @@ extern void XLogResetInsertion(void);
 extern bool XLogCheckBufferNeedsBackup(Buffer buffer);
 
 extern XLogRecPtr log_newpage(RelFileNode *rnode, ForkNumber forkNum,
-			BlockNumber blk, char *page, bool page_std);
-extern XLogRecPtr log_newpage_buffer(Buffer buffer, bool page_std);
+				  BlockNumber blk, char *page, bool page_std,
+				  bool is_flush);
+extern XLogRecPtr log_newpage_buffer(Buffer buffer, bool page_std,
+				  bool is_flush);
 extern XLogRecPtr XLogSaveBufferForHint(Buffer buffer, bool buffer_std);
 
 extern void InitXLogInsert(void);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index ad1eb4b..c6690be 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -71,8 +71,7 @@ typedef struct CheckPoint
 #define XLOG_RESTORE_POINT				0x70
 #define XLOG_FPW_CHANGE					0x80
 #define XLOG_END_OF_RECOVERY			0x90
-#define XLOG_FPI_FOR_HINT				0xA0
-#define XLOG_FPI						0xB0
+#define XLOG_FPI						0xA0
 
 
 /*
-- 
2.6.3

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to