Re: [HACKERS] [PATCH] introduce XLogLockBlockRangeForCleanup()

Abhijit Menon-Sen Tue, 23 Jun 2015 06:10:11 -0700

At 2014-08-20 11:07:44 +0300, [email protected] wrote:
>
> I don't think the new GetBufferWithoutRelcache function is in line
> with the existing ReadBuffer API. I think it would be better to add a
> new ReadBufferMode - RBM_CACHE_ONLY? - that only returns the buffer if
> it's already in cache, and InvalidBuffer otherwise.


Hi Heikki.

I liked the idea of having a new ReadBufferMode of RBM_CACHE_ONLY, but I
wasn't happy with the result when I tried to modify the code to suit. It
didn't make the code easier for me to follow.

(I'll say straight up that it can be done the way you said, and if the
consensus is that it would be an improvement, I'll do it that way. I'm
just not convinced about it, hence this mail.)

For example:

> static Buffer
> ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
>                   BlockNumber blockNum, ReadBufferMode mode,
>                   BufferAccessStrategy strategy, bool *hit)
> {
>     volatile BufferDesc *bufHdr;
>     Block       bufBlock;
>     bool        found;
>     bool        isExtend;
>     bool        isLocalBuf = SmgrIsTemp(smgr);
> 
>     *hit = false;
> 
>     /* Make sure we will have room to remember the buffer pin */
>     ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
> 
>     isExtend = (blockNum == P_NEW);

isLocalBuf and isExtend will both be false for the RBM_CACHE_ONLY case,
which is good for us. But that probably needs to be mentioned in a
comment here.

>     TRACE_POSTGRESQL_BUFFER_READ_START(forkNum, blockNum,
>                                        smgr->smgr_rnode.node.spcNode,
>                                        smgr->smgr_rnode.node.dbNode,
>                                        smgr->smgr_rnode.node.relNode,
>                                        smgr->smgr_rnode.backend,
>                                        isExtend);

We know we're not going to be doing IO in the RBM_CACHE_ONLY case, but
that's probably OK? I don't understand this TRACE_* stuff. But we need
to either skip this, or also do the corresponding TRACE_*_DONE later.

>     if (isLocalBuf)
>     {
>         bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
>         if (found)
>             pgBufferUsage.local_blks_hit++;
>         else
>             pgBufferUsage.local_blks_read++;
>     }
>     else
>     {
>         /*
>          * lookup the buffer.  IO_IN_PROGRESS is set if the requested block is
>          * not currently in memory.
>          */
>         bufHdr = BufferAlloc(smgr, relpersistence, forkNum, blockNum,
>                              strategy, &found);
>         if (found)
>             pgBufferUsage.shared_blks_hit++;
>         else
>             pgBufferUsage.shared_blks_read++;
>     }

The nicest option I could think of here was to copy the shared_buffers
lookup from BufferAlloc into a new BufferAlloc_shared function, and add
a new branch for RBM_CACHE_ONLY. That would look like:

    if (isLocalBuf)
        …
    else if (mode == RBM_CACHE_ONLY)
    {
        /*
         * We check to see if the buffer is already cached, and if it's
         * not, we return InvalidBuffer because we know it's not pinned.
         */
        bufHdr = BufferAlloc_shared(…, &found);
        if (!found)
            return InvalidBuffer;
        LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
        return BufferDescriptorGetBuffer(bufHdr);
    }
    else
    {
        /*
         * lookup the buffer …
    }

…or if we went with the rest of the code and didn't do the lock/return
immediately, we would fall through to this code:

>     /* At this point we do NOT hold any locks. */
> 
>     /* if it was already in the buffer pool, we're done */
>     if (found)
>     {
>         if (!isExtend)
>         {
>             /* Just need to update stats before we exit */
>             *hit = true;
>             VacuumPageHit++;
> 
>             if (VacuumCostActive)
>                 VacuumCostBalance += VacuumCostPageHit;
> 
>             TRACE_POSTGRESQL_BUFFER_READ_DONE(forkNum, blockNum,
>                                               smgr->smgr_rnode.node.spcNode,
>                                               smgr->smgr_rnode.node.dbNode,
>                                               smgr->smgr_rnode.node.relNode,
>                                               smgr->smgr_rnode.backend,
>                                               isExtend,
>                                               found);
> 
>             /*
>              * In RBM_ZERO_AND_LOCK mode the caller expects the page to be
>              * locked on return.
>              */
>             if (!isLocalBuf)
>             {
>                 if (mode == RBM_ZERO_AND_LOCK)
>                     LWLockAcquire(bufHdr->content_lock, LW_EXCLUSIVE);
>                 else if (mode == RBM_ZERO_AND_CLEANUP_LOCK)
>                     LockBufferForCleanup(BufferDescriptorGetBuffer(bufHdr));
>             }
> 
>             return BufferDescriptorGetBuffer(bufHdr);
>         }

Some of this isn't applicable to RBM_CACHE_ONLY, so we could either skip
it until we reach the RBM_ZERO_AND_CLEANUP_LOCK part, or add a new
branch before the «if (!isExtend)» one.

In summary, we're either adding one new branch that handles everything
related to RBM_CACHE_ONLY and side-stepping some things on the way that
happen to not apply (isLocalBuf, isExtend, TRACE_*), or we're adding a
bunch of tests to ReadBuffer_common.

But splitting BufferAlloc into BufferAlloc_shared and the rest of
BufferAlloc doesn't look especially nice either. If you look at the
shared_buffers lookup from BufferAlloc:

>     /* see if the block is in the buffer pool already */
>     LWLockAcquire(newPartitionLock, LW_SHARED);
>     buf_id = BufTableLookup(&newTag, newHash);
>     if (buf_id >= 0)
>     {
>         /*
>          * Found it.  Now, pin the buffer so no one can steal it from the
>          * buffer pool, and check to see if the correct data has been loaded
>          * into the buffer.
>          */
>         buf = GetBufferDescriptor(buf_id);
> 
>         valid = PinBuffer(buf, strategy);
> 
>         /* Can release the mapping lock as soon as we've pinned it */
>         LWLockRelease(newPartitionLock);
> 
>         *foundPtr = TRUE;
> 
>         if (!valid)
>         {
>             /*
>              * We can only get here if (a) someone else is still reading in
>              * the page, or (b) a previous read attempt failed.  We have to
>              * wait for any active read attempt to finish, and then set up our
>              * own read attempt if the page is still not BM_VALID.
>              * StartBufferIO does it all.
>              */
>             if (StartBufferIO(buf, true))
>             {
>                 /*
>                  * If we get here, previous attempts to read the buffer must
>                  * have failed ... but we shall bravely try again.
>                  */
>                 *foundPtr = FALSE;
>             }
>         }
> 
>         return buf;
>     }
> 
>     /*
>      * Didn't find it in the buffer pool.  We'll have to initialize a new
>      * buffer.  Remember to unlock the mapping lock while doing the work.
>      */
>     LWLockRelease(newPartitionLock);
> 
>     /* Loop here in case we have to try another victim buffer */
>     for (;;)
>     {

There are two options here. I split out BufferAlloc_shared and either
make BufferAlloc call it, or not.

If I make BufferAlloc call it, the LWLockAcquire/Release could move to
the _shared function without confusion, but I'll either have to return
the PinBuffer BM_VALID test result via a validPtr, or repeat the test
in the caller before calling StartBufferIO. Both the INIT_BUFFERTAG and
BufTableHashCode() would also have to be in both functions, since they
are used for BufTableInsert() later in BufferAlloc.

On the other hand, if I don't make BufferAlloc call the _shared
function, then we end up with a new few-line special-case function plus
either a self-contained new branch in ReadBuffer_common, or a bunch of
RBM_CACHE_ONLY tests. That doesn't really seem an improvement over what
the original patch did, i.e. introduce a single special-case function to
handle a single special-case access.

I'm open to suggestions.

-- Abhijit

P.S. For reference, I've attached the patch that I posted earlier,
rebased to apply to master.

commit 4cbbbd242b7ee6333e1d1a09794aa1cd9d39035f
Author: Abhijit Menon-Sen <[email protected]>
Date:   Tue Jun 23 13:38:01 2015 +0530

    Introduce XLogLockBlockRangeForCleanup()
    
    When replaying B-Tree vacuum records, we need to confirm that a range of
    blocks is unpinned, but there was no way to ask the buffer manager if a
    block was pinned without having it read in uncached pages.
    
    This patch exposes a new interface that returns pages if they're in the
    cache, or InvalidBuffer, and uses that in a function that locks a range
    of blocks without reading them in from disk.

diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 2f85867..e49a9ba 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -416,33 +416,10 @@ btree_xlog_vacuum(XLogReaderState *record)
 	{
 		RelFileNode thisrnode;
 		BlockNumber thisblkno;
-		BlockNumber blkno;
 
 		XLogRecGetBlockTag(record, 0, &thisrnode, NULL, &thisblkno);
-
-		for (blkno = xlrec->lastBlockVacuumed + 1; blkno < thisblkno; blkno++)
-		{
-			/*
-			 * We use RBM_NORMAL_NO_LOG mode because it's not an error
-			 * condition to see all-zero pages.  The original btvacuumpage
-			 * scan would have skipped over all-zero pages, noting them in FSM
-			 * but not bothering to initialize them just yet; so we mustn't
-			 * throw an error here.  (We could skip acquiring the cleanup lock
-			 * if PageIsNew, but it's probably not worth the cycles to test.)
-			 *
-			 * XXX we don't actually need to read the block, we just need to
-			 * confirm it is unpinned. If we had a special call into the
-			 * buffer manager we could optimise this so that if the block is
-			 * not in shared_buffers we confirm it as unpinned.
-			 */
-			buffer = XLogReadBufferExtended(thisrnode, MAIN_FORKNUM, blkno,
-											RBM_NORMAL_NO_LOG);
-			if (BufferIsValid(buffer))
-			{
-				LockBufferForCleanup(buffer);
-				UnlockReleaseBuffer(buffer);
-			}
-		}
+		XLogLockBlockRangeForCleanup(thisrnode, MAIN_FORKNUM,
+									 xlrec->lastBlockVacuumed + 1, thisblkno);
 	}
 
 	/*
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index fa98b82..4023c7e 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -401,10 +401,6 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
  * In RBM_ZERO_* modes, if the page doesn't exist, the relation is extended
  * with all-zeroes pages up to the given block number.
  *
- * In RBM_NORMAL_NO_LOG mode, we return InvalidBuffer if the page doesn't
- * exist, and we don't check for all-zeroes.  Thus, no log entry is made
- * to imply that the page should be dropped or truncated later.
- *
  * NB: A redo function should normally not call this directly. To get a page
  * to modify, use XLogReplayBuffer instead. It is important that all pages
  * modified by a WAL record are registered in the WAL records, or they will be
@@ -449,8 +445,6 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 			log_invalid_page(rnode, forknum, blkno, false);
 			return InvalidBuffer;
 		}
-		if (mode == RBM_NORMAL_NO_LOG)
-			return InvalidBuffer;
 		/* OK to extend the file */
 		/* we do this in recovery only - no rel-extension lock needed */
 		Assert(InRecovery);
@@ -500,6 +494,62 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 }
 
 /*
+ * XLogLockBlockRangeForCleanup is used in Hot Standby mode to emulate
+ * the locking effects imposed by VACUUM for nbtrees.
+ */
+void
+XLogLockBlockRangeForCleanup(RelFileNode rnode, ForkNumber forkNum,
+							 BlockNumber startBlkno, BlockNumber uptoBlkno)
+{
+	BlockNumber blkno;
+	BlockNumber lastblock;
+	BlockNumber endingBlkno;
+	Buffer      buffer;
+	SMgrRelation smgr;
+
+	Assert(startBlkno != P_NEW);
+	Assert(uptoBlkno != P_NEW);
+
+	/* Open the relation at smgr level */
+	smgr = smgropen(rnode, InvalidBackendId);
+
+	/*
+	 * Create the target file if it doesn't already exist.  This lets us cope
+	 * if the replay sequence contains writes to a relation that is later
+	 * deleted.  (The original coding of this routine would instead suppress
+	 * the writes, but that seems like it risks losing valuable data if the
+	 * filesystem loses an inode during a crash.  Better to write the data
+	 * until we are actually told to delete the file.)
+	 */
+	smgrcreate(smgr, forkNum, true);
+
+	lastblock = smgrnblocks(smgr, forkNum);
+
+	endingBlkno = uptoBlkno;
+	if (lastblock < endingBlkno)
+		endingBlkno = lastblock;
+
+	for (blkno = startBlkno; blkno < endingBlkno; blkno++)
+	{
+		/*
+		 * All we need to do here is prove that we can lock each block
+		 * with a cleanup lock. It's not an error to see all-zero pages
+		 * here because the original btvacuumpage would not have thrown
+		 * an error either.
+		 *
+		 * We don't actually need to read the block, we just need to
+		 * confirm it is unpinned.
+		 */
+		buffer = GetBufferWithoutRelcache(rnode, forkNum, blkno);
+		if (BufferIsValid(buffer))
+		{
+			LockBufferForCleanup(buffer);
+			UnlockReleaseBuffer(buffer);
+		}
+	}
+}
+
+/*
  * Struct actually returned by XLogFakeRelcacheEntry, though the declared
  * return type is Relation.
  */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index cc973b5..24c409d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -529,8 +529,6 @@ ReadBuffer(Relation reln, BlockNumber blockNum)
  * RBM_ZERO_AND_CLEANUP_LOCK is the same as RBM_ZERO_AND_LOCK, but acquires
  * a cleanup-strength lock on the page.
  *
- * RBM_NORMAL_NO_LOG mode is treated the same as RBM_NORMAL here.
- *
  * If strategy is not NULL, a nondefault buffer access strategy is used.
  * See buffer/README for details.
  */
@@ -591,6 +589,62 @@ ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 							 mode, strategy, &hit);
 }
 
+/*
+ * GetBufferWithoutRelcache returns Buffer iff available in shared_buffers,
+ * otherwise returns InvalidBuffer. Buffer is pinned, if available.
+ *
+ * Special purpose routine only executed during recovery, so uses a cut-down
+ * execution path rather than complicate ReadBuffer/AllocBuffer.
+ */
+Buffer
+GetBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
+						 BlockNumber blockNum)
+{
+	BufferTag	bufTag;			/* identity of requested block */
+	uint32		bufHash;		/* hash value for newTag */
+	LWLock	   *bufPartitionLock;		/* buffer partition lock for it */
+	int			buf_id;
+	volatile BufferDesc *buf;
+	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
+	bool		valid = false;
+
+	Assert(InRecovery);
+
+	/* Make sure we will have room to remember the buffer pin */
+	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(bufTag, smgr->smgr_rnode.node, forkNum, blockNum);
+
+	/* determine its hash code and partition lock ID */
+	bufHash = BufTableHashCode(&bufTag);
+	bufPartitionLock = BufMappingPartitionLock(bufHash);
+
+	/* see if the block is in the buffer pool already */
+	LWLockAcquire(bufPartitionLock, LW_SHARED);
+	buf_id = BufTableLookup(&bufTag, bufHash);
+	if (buf_id >= 0)
+	{
+		/*
+		 * Found it. Now, try to pin the buffer if it's valid, but leave
+		 * its usage count alone.
+		 */
+		buf = &BufferDescriptors[buf_id];
+
+		LockBufHdr(buf);
+		valid = (buf->flags & BM_VALID) != 0;
+		if (valid)
+			PinBuffer_Locked(buf);
+		else
+			UnlockBufHdr(buf);
+	}
+	LWLockRelease(bufPartitionLock);
+
+	if (valid)
+		return BufferDescriptorGetBuffer(buf);
+
+	return InvalidBuffer;
+}
 
 /*
  * ReadBuffer_common -- common logic for all ReadBuffer variants
diff --git a/src/include/access/xlogutils.h b/src/include/access/xlogutils.h
index 8cf51c7..b4cd762 100644
--- a/src/include/access/xlogutils.h
+++ b/src/include/access/xlogutils.h
@@ -43,7 +43,9 @@ extern XLogRedoAction XLogReadBufferForRedoExtended(XLogReaderState *record,
 
 extern Buffer XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 					   BlockNumber blkno, ReadBufferMode mode);
-
+extern void XLogLockBlockRangeForCleanup(RelFileNode rnode, ForkNumber forkNum,
+										 BlockNumber startBlkno,
+										 BlockNumber uptoBlkno);
 extern Relation CreateFakeRelcacheEntry(RelFileNode rnode);
 extern void FreeFakeRelcacheEntry(Relation fakerel);
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ec0a254..b0bba41 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -40,9 +40,7 @@ typedef enum
 								 * initialize. Also locks the page. */
 	RBM_ZERO_AND_CLEANUP_LOCK,	/* Like RBM_ZERO_AND_LOCK, but locks the page
 								 * in "cleanup" mode */
-	RBM_ZERO_ON_ERROR,			/* Read, but return an all-zeros page on error */
-	RBM_NORMAL_NO_LOG			/* Don't log page as invalid during WAL
-								 * replay; otherwise same as RBM_NORMAL */
+	RBM_ZERO_ON_ERROR			/* Read, but return an all-zeros page on error */
 } ReadBufferMode;
 
 /* in globals.c ... this duplicates miscadmin.h */
@@ -153,6 +151,8 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 						  ForkNumber forkNum, BlockNumber blockNum,
 						  ReadBufferMode mode, BufferAccessStrategy strategy);
+extern Buffer GetBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
+									   BlockNumber blockNum);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH] introduce XLogLockBlockRangeForCleanup()

Reply via email to