Re: [HACKERS] double-buffering page writes

2008-10-23 Thread Alvaro Herrera
ITAGAKI Takahiro wrote:

 I have some comments about the double-buffering:

Since posting this patch I have realized that this implementation is
bogus.  I'm now playing with WAL-logging hint bits though.  As to your
questions:

 - Are there any performance degradation because of addtional memcpy?
   8kB of memcpy seems not to be free.

Of course, it is not free.  However it comes with the benefit that we
can release the io_in_progress lock earlier for the block -- we lock,
copy, unlock; whereas the old code did lock, write(), unlock.  Avoding a
system call in the locked area could be a win.  Whether this is a net
benefit is something that I have not measured.


 - Is it ok to allocale dblbuf[BLCKSZ] as local variable?
   It might be unaligned. AFAICS we avoid such usages in other places.

I thought about that too.  I admit I am not sure if this really works
portably; however I don't want to add a palloc() to that routine.

 - It is the best if we can delay double-buffering until locks are
   conflicted actually. But we might need to allocale shadow buffers
   from shared buffers instead of local memory.

The point of double-buffering is that the potential writer (a process
doing concurrent hint-bit setting) is not going to grab any locks.

 - Are there any other modules that can share in the benefits of
   double-buffering? For example, we could avoid avoid waiting for
   LockBufferForCleanup(). It is cool if the double-buffering can
   be used for multiple purposes.

Not sure on this.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] double-buffering page writes

2008-10-23 Thread Heikki Linnakangas

Alvaro Herrera wrote:

ITAGAKI Takahiro wrote:


I have some comments about the double-buffering:


Since posting this patch I have realized that this implementation is
bogus.  I'm now playing with WAL-logging hint bits though.  


Yeah, the torn page + hint bit updates problem is the tough question.


- Is it ok to allocale dblbuf[BLCKSZ] as local variable?
  It might be unaligned. AFAICS we avoid such usages in other places.


I thought about that too.  I admit I am not sure if this really works
portably; however I don't want to add a palloc() to that routine.


It should work, AFAIK, but unaligned memcpy()s and write()s can be a 
significantly slower. There can be only one write() happening at any 
time, so you could just palloc() a single 8k buffer in TopMemoryContext 
in backend startup, and always use that.



- Are there any other modules that can share in the benefits of
  double-buffering? For example, we could avoid avoid waiting for
  LockBufferForCleanup(). It is cool if the double-buffering can
  be used for multiple purposes.


Not sure on this.


You'd need to keep both versions of the buffer simultaneously in the 
buffer cache for that. When we talked about the various designs for HOT, 
that was one of the ideas I had to enable more aggressive pruning: if 
you can't immediately get a vacuum lock, allocate a new buffer in the 
buffer cache for the same block, copy the page to the new buffer, and do 
the pruning, including moving tuples around, there. Any new ReadBuffer 
calls would return the new page version, but old readers would keep 
referencing the old one. The intrusive part of that approach, in 
addition to the obvious changes required in the buffer manager to keep 
around multiple copies of the same block, is that all modifications must 
be done on the new version, so anyone who needs to lock the page for 
modification would need to switch to the new page version at the 
LockBuffer call.


As discussed in the other thread with Simon, we also use vacuum locks in 
b-tree for waiting out index scans, so avoiding the waiting there would 
be just wrong.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] double-buffering page writes

2008-10-23 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 Alvaro Herrera wrote:
 I thought about that too.  I admit I am not sure if this really works
 portably; however I don't want to add a palloc() to that routine.

 It should work, AFAIK, but unaligned memcpy()s and write()s can be a 
 significantly slower. There can be only one write() happening at any 
 time, so you could just palloc() a single 8k buffer in TopMemoryContext 
 in backend startup, and always use that.

Some time ago, we arranged for shared buffers to be aligned on *more*
than maxalign boundaries (cf BUFFERALIGN) because on at least some
platforms this makes a very significant difference in the speed of
copying to/from kernel space.  If you are going to double-buffer it
is going to be important to have the intermediate buffer just as well
aligned.  A local char array won't be acceptable, and even for a
palloc'd one you'll need to take some extra steps (like wasting
ALIGNOF_BUFFER extra bytes so you can align the pointer palloc gives).

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] double-buffering page writes

2008-10-22 Thread ITAGAKI Takahiro

Alvaro Herrera [EMAIL PROTECTED] wrote:

 I'm trying to see if it makes sense to do the double-buffering of page
 writes before going further ahead with CRC checking.  I came up with the
 attached patch; it does the double-buffering inconditionally, because as
 it was said, it allows releasing the io_in_progress lock (and resetting
 BM_IO_IN_PROGRESS) early.

I have some comments about the double-buffering:

- Are there any performance degradation because of addtional memcpy?
  8kB of memcpy seems not to be free.

- Is it ok to allocale dblbuf[BLCKSZ] as local variable?
  It might be unaligned. AFAICS we avoid such usages in other places.

- It is the best if we can delay double-buffering until locks are
  conflicted actually. But we might need to allocale shadow buffers
  from shared buffers instead of local memory.

- Are there any other modules that can share in the benefits of
  double-buffering? For example, we could avoid avoid waiting for
  LockBufferForCleanup(). It is cool if the double-buffering can
  be used for multiple purposes.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] double-buffering page writes

2008-10-21 Thread Alvaro Herrera
Hi,

I'm trying to see if it makes sense to do the double-buffering of page
writes before going further ahead with CRC checking.  I came up with the
attached patch; it does the double-buffering inconditionally, because as
it was said, it allows releasing the io_in_progress lock (and resetting
BM_IO_IN_PROGRESS) early.

So far I have not managed to convince me that this is a correct change
to make; the io_in_progress bits are pretty convoluted -- for example, I
wonder how does releasing the buffer early (before actually sending
the write to the kernel and marking it not dirty) interact with
checkpoint and a possible full-page image.

Basically the change is to take the unsetting of BM_DIRTY out of
TerminateBufferIO and into its own routine; and in FlushBuffer, release
io_in_progress just after copying the buffer contents elsewhere, and
mark the buffer not dirty after actually doing the write.

Thoughts?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Index: src/backend/storage/buffer/bufmgr.c
===
RCS file: /home/alvherre/Code/cvs/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.239
diff -c -p -r1.239 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c	20 Oct 2008 21:11:15 -	1.239
--- src/backend/storage/buffer/bufmgr.c	21 Oct 2008 15:30:38 -
*** static void BufferSync(int flags);
*** 84,91 
  static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
  static void WaitIO(volatile BufferDesc *buf);
  static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
! static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
!   int set_flag_bits);
  static void buffer_write_error_callback(void *arg);
  static volatile BufferDesc *BufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
  			BlockNumber blockNum,
--- 84,91 
  static int	SyncOneBuffer(int buf_id, bool skip_recently_used);
  static void WaitIO(volatile BufferDesc *buf);
  static bool StartBufferIO(volatile BufferDesc *buf, bool forInput);
! static void TerminateBufferIO(volatile BufferDesc *buf, int set_flag_bits);
! static void MarkBufferNotDirty(volatile BufferDesc *buf);
  static void buffer_write_error_callback(void *arg);
  static volatile BufferDesc *BufferAlloc(SMgrRelation smgr, ForkNumber forkNum,
  			BlockNumber blockNum,
*** ReadBuffer_common(SMgrRelation smgr, boo
*** 395,401 
  	else
  	{
  		/* Set BM_VALID, terminate IO, and wake up any waiters */
! 		TerminateBufferIO(bufHdr, false, BM_VALID);
  	}
  
  	if (VacuumCostActive)
--- 395,401 
  	else
  	{
  		/* Set BM_VALID, terminate IO, and wake up any waiters */
! 		TerminateBufferIO(bufHdr, BM_VALID);
  	}
  
  	if (VacuumCostActive)
*** FlushBuffer(volatile BufferDesc *buf, SM
*** 1792,1797 
--- 1792,1798 
  {
  	XLogRecPtr	recptr;
  	ErrorContextCallback errcontext;
+ 	char		dblbuf[BLCKSZ];
  
  	/*
  	 * Acquire the buffer's io_in_progress lock.  If StartBufferIO returns
*** FlushBuffer(volatile BufferDesc *buf, SM
*** 1834,1856 
  	buf-flags = ~BM_JUST_DIRTIED;
  	UnlockBufHdr(buf);
  
  	smgrwrite(reln,
  			  buf-tag.forkNum,
  			  buf-tag.blockNum,
! 			  (char *) BufHdrGetBlock(buf),
  			  false);
  
  	BufferFlushCount++;
  
  	TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(reln-smgr_rnode.spcNode,
  		 reln-smgr_rnode.dbNode, reln-smgr_rnode.relNode);
  
- 	/*
- 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
- 	 * end the io_in_progress state.
- 	 */
- 	TerminateBufferIO(buf, true, 0);
- 
  	/* Pop the error context stack */
  	error_context_stack = errcontext.previous;
  }
--- 1835,1863 
  	buf-flags = ~BM_JUST_DIRTIED;
  	UnlockBufHdr(buf);
  
+ 	/*
+ 	 * We make a copy of the buffer to write.  This allows us to release the
+ 	 * io_in_progress lock early, before actually doing the write.
+ 	 */
+ 	memcpy(dblbuf, BufHdrGetBlock(buf), BLCKSZ);
+ 
+ 	/* End the io_in_progress state. */
+ 	TerminateBufferIO(buf, 0);
+ 
  	smgrwrite(reln,
  			  buf-tag.forkNum,
  			  buf-tag.blockNum,
! 			  (char *) dblbuf,
  			  false);
  
+ 	/* Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) */
+ 	MarkBufferNotDirty(buf);
+ 
  	BufferFlushCount++;
  
  	TRACE_POSTGRESQL_BUFFER_FLUSH_DONE(reln-smgr_rnode.spcNode,
  		 reln-smgr_rnode.dbNode, reln-smgr_rnode.relNode);
  
  	/* Pop the error context stack */
  	error_context_stack = errcontext.previous;
  }
*** StartBufferIO(volatile BufferDesc *buf, 
*** 2578,2595 
   *	We hold the buffer's io_in_progress lock
   *	The buffer is Pinned
   *
-  * If clear_dirty is TRUE and BM_JUST_DIRTIED is not set, we clear the
-  * buffer's BM_DIRTY flag.  This is appropriate when terminating a
-  * successful write.  The check on BM_JUST_DIRTIED is necessary to avoid
-  * marking the buffer clean if it was