Re: [HACKERS] Avoiding unnecessary reads in recovery

2007-05-02 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 What we could have is the semantics of Return a buffer, with either 
 correct contents or completely zeroed out. It would act just like 
 ReadBuffer if the buffer was already in memory, and zero out the page 
 otherwise. That's a bit strange semantics to have, but is simple to 
 implement and works for the use-cases we've been talking about.

 Patch implementing that attached. I named the function ReadOrZeroBuffer.

Applied.  BTW, I realized that there is a potential issue created by
this, which is that the smgr level might see a write for a page that it
never saw a read for.  I don't think there are any bad consequences of
this ATM, but it is skating around the edges of some bugs we've had
previously with relation extension.  In particular ReadOrZeroBuffer
avoids the error that would normally occur if one tries to read a page
that's beyond the logical EOF; and if the page is subsequently modified
and written, md.c is likely to get confused/unhappy, particularly if the
page is beyond the next segment boundary.  This isn't a problem in
XLogReadBuffer's usage because it carefully checks the EOF position
before trying to use ReadOrZeroBuffer, but it's a limitation other
callers will need to think about.

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [HACKERS] Avoiding unnecessary reads in recovery

2007-04-28 Thread Simon Riggs
On Fri, 2007-04-27 at 12:22 +0100, Heikki Linnakangas wrote:
 Tom Lane wrote:
  Simon Riggs [EMAIL PROTECTED] writes:
  As regards the zero_damaged_pages question, I raised that some time ago
  but we didn't arrive at an explicit answer. All I would say is we can't
  allow invalid pages in the buffer manager at any time, whatever options
  we have requested, otherwise other code will fail almost immediately.
  
  Yeah --- the proposed new bufmgr routine should probably explicitly zero
  the content of the buffer.  It doesn't really matter in the context of
  WAL recovery, since there can't be any concurrent access to the buffer,
  but it'd make it safe to use in non-WAL contexts (I think there are
  other places where we know we are going to init the page and so a
  physical read is a waste of time).  
 
 To implement that correctly, I think we'd need to take the content lock 
 to clear the buffer if it's already found in the cache. It doesn't seem 
 right to me for the buffer manager to do that, in the worst case it 
 could lead to deadlocks if that function was ever used while holding 
 another buffer locked.
 
 What we could have is the semantics of Return a buffer, with either 
 correct contents or completely zeroed out. It would act just like 
 ReadBuffer if the buffer was already in memory, and zero out the page 
 otherwise. That's a bit strange semantics to have, but is simple to 
 implement and works for the use-cases we've been talking about.

Sounds good.

 Patch implementing that attached. I named the function ReadOrZeroBuffer.

We already have an API quirk similar to this: relation extension. It
seems strange to have two different kinds of special case API that are
used alongside each other in XLogReadBuffer()

Currently if we extend by a block we say
buffer = ReadBuffer(reln, P_NEW);

Why not just add another option, so where you use ReadOrZeroBuffer we
just say
buffer = ReadBuffer(reln, P_INIT);

which we then check for on entry by saying
isInit = (blockNum == P_INIT);
just as we already do for P_NEW

That way you can do the code like this
if (isExtend || isInit)
{
/* new or initialised buffers are zero-filled */
MemSet((char *) bufBlock, 0, BLCKSZ);
if (isExtend)
smgrextend(reln-rd_smgr, blockNum, 
   (char *) bufBlock,
   reln-rd_istemp);
}

That way we don't have to have ReadBuffer_common etc..

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Avoiding unnecessary reads in recovery

2007-04-28 Thread Heikki Linnakangas

Simon Riggs wrote:

On Fri, 2007-04-27 at 12:22 +0100, Heikki Linnakangas wrote:

Patch implementing that attached. I named the function ReadOrZeroBuffer.


We already have an API quirk similar to this: relation extension. It
seems strange to have two different kinds of special case API that are
used alongside each other in XLogReadBuffer()

Currently if we extend by a block we say
buffer = ReadBuffer(reln, P_NEW);

Why not just add another option, so where you use ReadOrZeroBuffer we
just say
buffer = ReadBuffer(reln, P_INIT);


Because ReadOrZeroBuffer needs the block number as an argument.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Avoiding unnecessary reads in recovery

2007-04-28 Thread Heikki Linnakangas
I was actually thinking that we could slip this in 8.3. It's a simple, 
well-understood patch, which fixes a little data integrity quirk as well 
as gives a nice recovery speed up.


Bruce Momjian wrote:

I assume this is 8.4 material.

---

Heikki Linnakangas wrote:

Tom Lane wrote:

Simon Riggs [EMAIL PROTECTED] writes:

As regards the zero_damaged_pages question, I raised that some time ago
but we didn't arrive at an explicit answer. All I would say is we can't
allow invalid pages in the buffer manager at any time, whatever options
we have requested, otherwise other code will fail almost immediately.

Yeah --- the proposed new bufmgr routine should probably explicitly zero
the content of the buffer.  It doesn't really matter in the context of
WAL recovery, since there can't be any concurrent access to the buffer,
but it'd make it safe to use in non-WAL contexts (I think there are
other places where we know we are going to init the page and so a
physical read is a waste of time).  
To implement that correctly, I think we'd need to take the content lock 
to clear the buffer if it's already found in the cache. It doesn't seem 
right to me for the buffer manager to do that, in the worst case it 
could lead to deadlocks if that function was ever used while holding 
another buffer locked.


What we could have is the semantics of Return a buffer, with either 
correct contents or completely zeroed out. It would act just like 
ReadBuffer if the buffer was already in memory, and zero out the page 
otherwise. That's a bit strange semantics to have, but is simple to 
implement and works for the use-cases we've been talking about.


Patch implementing that attached. I named the function ReadOrZeroBuffer.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com




---(end of broadcast)---
TIP 6: explain analyze is your friend





--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [HACKERS] Avoiding unnecessary reads in recovery

2007-04-28 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 I was actually thinking that we could slip this in 8.3. It's a simple, 
 well-understood patch, which fixes a little data integrity quirk as well 
 as gives a nice recovery speed up.

Yeah.  It's arguably a bug fix, in fact, since it eliminates the issue
that the recovery behavior is wrong if full-page-writes had been off
when the WAL log was made.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Avoiding unnecessary reads in recovery

2007-04-27 Thread Heikki Linnakangas

Tom Lane wrote:

Simon Riggs [EMAIL PROTECTED] writes:

As regards the zero_damaged_pages question, I raised that some time ago
but we didn't arrive at an explicit answer. All I would say is we can't
allow invalid pages in the buffer manager at any time, whatever options
we have requested, otherwise other code will fail almost immediately.


Yeah --- the proposed new bufmgr routine should probably explicitly zero
the content of the buffer.  It doesn't really matter in the context of
WAL recovery, since there can't be any concurrent access to the buffer,
but it'd make it safe to use in non-WAL contexts (I think there are
other places where we know we are going to init the page and so a
physical read is a waste of time).  


To implement that correctly, I think we'd need to take the content lock 
to clear the buffer if it's already found in the cache. It doesn't seem 
right to me for the buffer manager to do that, in the worst case it 
could lead to deadlocks if that function was ever used while holding 
another buffer locked.


What we could have is the semantics of Return a buffer, with either 
correct contents or completely zeroed out. It would act just like 
ReadBuffer if the buffer was already in memory, and zero out the page 
otherwise. That's a bit strange semantics to have, but is simple to 
implement and works for the use-cases we've been talking about.


Patch implementing that attached. I named the function ReadOrZeroBuffer.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/access/transam/xlogutils.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlogutils.c,v
retrieving revision 1.49
diff -c -r1.49 xlogutils.c
*** src/backend/access/transam/xlogutils.c	5 Jan 2007 22:19:24 -	1.49
--- src/backend/access/transam/xlogutils.c	27 Apr 2007 10:44:37 -
***
*** 226,232 
  	if (blkno  lastblock)
  	{
  		/* page exists in file */
! 		buffer = ReadBuffer(reln, blkno);
  	}
  	else
  	{
--- 226,235 
  	if (blkno  lastblock)
  	{
  		/* page exists in file */
! 		if (init)
! 			buffer = ReadOrZeroBuffer(reln, blkno);
! 		else
! 			buffer = ReadBuffer(reln, blkno);
  	}
  	else
  	{
Index: src/backend/storage/buffer/bufmgr.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.216
diff -c -r1.216 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c	30 Mar 2007 18:34:55 -	1.216
--- src/backend/storage/buffer/bufmgr.c	27 Apr 2007 10:49:08 -
***
*** 17,22 
--- 17,26 
   *		and pin it so that no one can destroy it while this process
   *		is using it.
   *
+  * ReadOrZeroBuffer() -- like ReadBuffer, but clears out the contents of the 
+  *		page if it's not in cache already. Used in recovery when the page is
+  *		completely overwritten from WAL.
+  *
   * ReleaseBuffer() -- unpin a buffer
   *
   * MarkBufferDirty() -- mark a pinned buffer's contents as dirty.
***
*** 97,102 
--- 101,107 
  static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
    int set_flag_bits);
  static void buffer_write_error_callback(void *arg);
+ static Buffer ReadBuffer_common(Relation reln, BlockNumber blockNum, bool alloc_only);
  static volatile BufferDesc *BufferAlloc(Relation reln, BlockNumber blockNum,
  			bool *foundPtr);
  static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
***
*** 121,126 
--- 126,152 
  Buffer
  ReadBuffer(Relation reln, BlockNumber blockNum)
  {
+ 	return ReadBuffer_common(reln, blockNum, false);
+ }
+ 
+ /*
+  * ReadOrZeroBuffer -- like ReadBuffer, but if the page isn't in buffer
+  *		cache already, it's filled with zeros instead of reading it from
+  *		disk. The caller is expected to rewrite it with real data,
+  *		regardless of the current contents.
+  */
+ Buffer
+ ReadOrZeroBuffer(Relation reln, BlockNumber blockNum)
+ {
+ 	return ReadBuffer_common(reln, blockNum, true);
+ }
+ 
+ /*
+  * ReadBuffer_common -- common logic of ReadBuffer and ReadZeroedBuffer
+  */
+ static Buffer
+ ReadBuffer_common(Relation reln, BlockNumber blockNum, bool zeroPage)
+ {
  	volatile BufferDesc *bufHdr;
  	Block		bufBlock;
  	bool		found;
***
*** 253,269 
  	}
  	else
  	{
  		smgrread(reln-rd_smgr, blockNum, (char *) bufBlock);
  		/* check for garbage data */
  		if (!PageHeaderIsValid((PageHeader) bufBlock))
  		{
  			/*
! 			 * During WAL recovery, the first access to any data page should
! 			 * overwrite the whole page from the WAL; so a clobbered page
! 			 * header is not reason to fail.  Hence, when InRecovery we may
! 			 * always act as though zero_damaged_pages is ON.
  			 */
! 			if (zero_damaged_pages || InRecovery)
  			{
  ereport(WARNING,
  		(errcode(ERRCODE_DATA_CORRUPTED),
--- 279,304 
  	

Re: [HACKERS] Avoiding unnecessary reads in recovery

2007-04-27 Thread Alvaro Herrera
Heikki Linnakangas wrote:

 What we could have is the semantics of Return a buffer, with either 
 correct contents or completely zeroed out. It would act just like 
 ReadBuffer if the buffer was already in memory, and zero out the page 
 otherwise. That's a bit strange semantics to have, but is simple to 
 implement and works for the use-cases we've been talking about.

Huh, why does that work in the case where the recovery code reads a
page, then evicts it because of memory pressure, and later needs to read
it again?

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Avoiding unnecessary reads in recovery

2007-04-27 Thread Heikki Linnakangas

Alvaro Herrera wrote:

Heikki Linnakangas wrote:

What we could have is the semantics of Return a buffer, with either 
correct contents or completely zeroed out. It would act just like 
ReadBuffer if the buffer was already in memory, and zero out the page 
otherwise. That's a bit strange semantics to have, but is simple to 
implement and works for the use-cases we've been talking about.


Huh, why does that work in the case where the recovery code reads a
page, then evicts it because of memory pressure, and later needs to read
it again?


I don't understand the problem. You only use ReadOrZeroBuffer when 
you're going to replace the contents entirely, and don't care about the 
old contents. If you want to read something in, you use ReadBuffer.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

  http://www.postgresql.org/docs/faq


Re: [HACKERS] Avoiding unnecessary reads in recovery

2007-04-27 Thread Alvaro Herrera
Heikki Linnakangas wrote:
 Alvaro Herrera wrote:
 Heikki Linnakangas wrote:
 
 What we could have is the semantics of Return a buffer, with either 
 correct contents or completely zeroed out. It would act just like 
 ReadBuffer if the buffer was already in memory, and zero out the page 
 otherwise. That's a bit strange semantics to have, but is simple to 
 implement and works for the use-cases we've been talking about.
 
 Huh, why does that work in the case where the recovery code reads a
 page, then evicts it because of memory pressure, and later needs to read
 it again?
 
 I don't understand the problem. You only use ReadOrZeroBuffer when 
 you're going to replace the contents entirely, and don't care about the 
 old contents. If you want to read something in, you use ReadBuffer.

Oh, the recovery code selects which one to call based on the init
param, which is on the first hunk of the diff :-)  I forgot that, I was
just thinking in your description above.

-- 
Alvaro Herrerahttp://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Avoiding unnecessary reads in recovery

2007-04-27 Thread Bruce Momjian

I assume this is 8.4 material.

---

Heikki Linnakangas wrote:
 Tom Lane wrote:
  Simon Riggs [EMAIL PROTECTED] writes:
  As regards the zero_damaged_pages question, I raised that some time ago
  but we didn't arrive at an explicit answer. All I would say is we can't
  allow invalid pages in the buffer manager at any time, whatever options
  we have requested, otherwise other code will fail almost immediately.
  
  Yeah --- the proposed new bufmgr routine should probably explicitly zero
  the content of the buffer.  It doesn't really matter in the context of
  WAL recovery, since there can't be any concurrent access to the buffer,
  but it'd make it safe to use in non-WAL contexts (I think there are
  other places where we know we are going to init the page and so a
  physical read is a waste of time).  
 
 To implement that correctly, I think we'd need to take the content lock 
 to clear the buffer if it's already found in the cache. It doesn't seem 
 right to me for the buffer manager to do that, in the worst case it 
 could lead to deadlocks if that function was ever used while holding 
 another buffer locked.
 
 What we could have is the semantics of Return a buffer, with either 
 correct contents or completely zeroed out. It would act just like 
 ReadBuffer if the buffer was already in memory, and zero out the page 
 otherwise. That's a bit strange semantics to have, but is simple to 
 implement and works for the use-cases we've been talking about.
 
 Patch implementing that attached. I named the function ReadOrZeroBuffer.
 
 -- 
Heikki Linnakangas
EnterpriseDB   http://www.enterprisedb.com


 
 ---(end of broadcast)---
 TIP 6: explain analyze is your friend

-- 
  Bruce Momjian  [EMAIL PROTECTED]  http://momjian.us
  EnterpriseDB   http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Avoiding unnecessary reads in recovery

2007-04-27 Thread Simon Riggs
On Fri, 2007-04-27 at 10:37 -0400, Bruce Momjian wrote:
 I assume this is 8.4 material.

I think its a small enough, performance-only change to allow it at this
time. It will provide considerable additional benefit for Warm Standby
servers.

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [HACKERS] Avoiding unnecessary reads in recovery

2007-04-26 Thread Jim Nasby

On Apr 25, 2007, at 2:48 PM, Heikki Linnakangas wrote:
In recovery, with full_pages_writes=on, we read in each page only  
to overwrite the contents with a full page image. That's a waste of  
time, and can have a surprisingly large effect on recovery time.


As a quick test on my laptop, I initialized a DBT-2 test with 5  
warehouses, and let it run for 2 minutes without think-times to  
generate some WAL. Then I did a kill -9 postmaster, and took a  
copy of the data directory to use for testing recovery.


With CVS HEAD, the recovery took ~ 2 minutes. With the attached  
patch, it took 5 seconds. (yes, I used the same not-yet-recovered  
data directory in both tests, and cleared the os cache with echo 1  
 /proc/sys/vm/drop_caches).


I was surprised how big a difference it makes, but when you think  
about it it's logical. Without the patch, it's doing roughly the  
same I/O as the test itself, reading in pages, modifying them, and  
writing them back. With the patch, all the reads are done  
sequentially from the WAL, and then written back in a batch at the  
end of the WAL replay which is a lot more efficient.


It's interesting that (with the patch) full_page_writes can  
*shorten* your recovery time. I've always thought it to have a  
purely negative effect on performance.


I'll leave it up to the jury if this tiny little change is  
appropriate after feature freeze...


While working on this, this comment in ReadBuffer caught my eye:


/*
 * During WAL recovery, the first access to any data page should
 * overwrite the whole page from the WAL; so a clobbered page
 * header is not reason to fail.  Hence, when InRecovery we may
 * always act as though zero_damaged_pages is ON.
 */
if (zero_damaged_pages || InRecovery)
{


But that assumption only holds if full_page_writes is enabled,  
right? I changed that in the attached patch as well, but if it  
isn't accepted that part of it should still be applied, I think.


So what happens if a backend is running with full_page_writes = off,  
someone edits postgresql.conf to turns it on and forgets to reload/ 
restart, and then we crash? You'll come up in recovery mode thinking  
that f_p_w was turned on, when in fact it wasn't.


ISTM that we need to somehow log what the status of full_page_writes  
is, if it's going to affect how recovery works.

--
Jim Nasby[EMAIL PROTECTED]
EnterpriseDB  http://enterprisedb.com  512.569.9461 (cell)



---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [HACKERS] Avoiding unnecessary reads in recovery

2007-04-26 Thread Zeugswetter Andreas ADI SD

 So what happens if a backend is running with full_page_writes 
 = off, someone edits postgresql.conf to turns it on and 
 forgets to reload/ restart, and then we crash? You'll come up 
 in recovery mode thinking that f_p_w was turned on, when in 
 fact it wasn't.
 
 ISTM that we need to somehow log what the status of 
 full_page_writes is, if it's going to affect how recovery works.

Optimally recovery should do this when confronted with a full page image
only. The full page is in the same WAL record that first touches a page,
so this should not need to depend on a setting.

Andreas

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Avoiding unnecessary reads in recovery

2007-04-26 Thread Tom Lane
Jim Nasby [EMAIL PROTECTED] writes:
 So what happens if a backend is running with full_page_writes = off,  
 someone edits postgresql.conf to turns it on and forgets to reload/ 
 restart, and then we crash? You'll come up in recovery mode thinking  
 that f_p_w was turned on, when in fact it wasn't.

One of the advantages of the proposed patch is that it avoids having to
make any assumptions like that.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


[HACKERS] Avoiding unnecessary reads in recovery

2007-04-25 Thread Heikki Linnakangas
In recovery, with full_pages_writes=on, we read in each page only to 
overwrite the contents with a full page image. That's a waste of time, 
and can have a surprisingly large effect on recovery time.


As a quick test on my laptop, I initialized a DBT-2 test with 5 
warehouses, and let it run for 2 minutes without think-times to generate 
some WAL. Then I did a kill -9 postmaster, and took a copy of the data 
directory to use for testing recovery.


With CVS HEAD, the recovery took ~ 2 minutes. With the attached patch, 
it took 5 seconds. (yes, I used the same not-yet-recovered data 
directory in both tests, and cleared the os cache with echo 1  
/proc/sys/vm/drop_caches).


I was surprised how big a difference it makes, but when you think about 
it it's logical. Without the patch, it's doing roughly the same I/O as 
the test itself, reading in pages, modifying them, and writing them 
back. With the patch, all the reads are done sequentially from the WAL, 
and then written back in a batch at the end of the WAL replay which is a 
lot more efficient.


It's interesting that (with the patch) full_page_writes can *shorten* 
your recovery time. I've always thought it to have a purely negative 
effect on performance.


I'll leave it up to the jury if this tiny little change is appropriate 
after feature freeze...


While working on this, this comment in ReadBuffer caught my eye:


/*
 * During WAL recovery, the first access to any data page should
 * overwrite the whole page from the WAL; so a clobbered page
 * header is not reason to fail.  Hence, when InRecovery we may
 * always act as though zero_damaged_pages is ON.
 */
if (zero_damaged_pages || InRecovery)
{


But that assumption only holds if full_page_writes is enabled, right? I 
changed that in the attached patch as well, but if it isn't accepted 
that part of it should still be applied, I think.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/access/transam/xlogutils.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlogutils.c,v
retrieving revision 1.49
diff -c -r1.49 xlogutils.c
*** src/backend/access/transam/xlogutils.c	5 Jan 2007 22:19:24 -	1.49
--- src/backend/access/transam/xlogutils.c	25 Apr 2007 11:40:09 -
***
*** 226,232 
  	if (blkno  lastblock)
  	{
  		/* page exists in file */
! 		buffer = ReadBuffer(reln, blkno);
  	}
  	else
  	{
--- 226,235 
  	if (blkno  lastblock)
  	{
  		/* page exists in file */
! 		if(init)
! 			buffer = ZapBuffer(reln, blkno);
! 		else
! 			buffer = ReadBuffer(reln, blkno);
  	}
  	else
  	{
Index: src/backend/storage/buffer/bufmgr.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.216
diff -c -r1.216 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c	30 Mar 2007 18:34:55 -	1.216
--- src/backend/storage/buffer/bufmgr.c	25 Apr 2007 11:44:27 -
***
*** 97,102 
--- 97,103 
  static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
    int set_flag_bits);
  static void buffer_write_error_callback(void *arg);
+ static Buffer ReadBuffer_common(Relation reln, BlockNumber blockNum, bool alloc_only);
  static volatile BufferDesc *BufferAlloc(Relation reln, BlockNumber blockNum,
  			bool *foundPtr);
  static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
***
*** 121,126 
--- 122,148 
  Buffer
  ReadBuffer(Relation reln, BlockNumber blockNum)
  {
+ 	return ReadBuffer_common(reln, blockNum, false);
+ }
+ 
+ /*
+  * ZapBuffer -- like ReadBuffer, but doesn't read the contents of the page
+  *		from disk. The caller is expected to completely rewrite the page,
+  *		regardless of the current contents. This should only be used in
+  *		recovery where there's no concurrent readers that might see the
+  *		contents of the page before the caller rewrites it.
+  */
+ Buffer
+ ZapBuffer(Relation reln, BlockNumber blockNum)
+ {
+ 	Assert(InRecovery);
+ 
+ 	return ReadBuffer_common(reln, blockNum, true);
+ }
+ 
+ static Buffer
+ ReadBuffer_common(Relation reln, BlockNumber blockNum, bool alloc_only)
+ {
  	volatile BufferDesc *bufHdr;
  	Block		bufBlock;
  	bool		found;
***
*** 253,269 
  	}
  	else
  	{
  		smgrread(reln-rd_smgr, blockNum, (char *) bufBlock);
  		/* check for garbage data */
  		if (!PageHeaderIsValid((PageHeader) bufBlock))
  		{
  			/*
! 			 * During WAL recovery, the first access to any data page should
! 			 * overwrite the whole page from the WAL; so a clobbered page
! 			 * header is not reason to fail.  Hence, when InRecovery we may
! 			 * always act as though zero_damaged_pages is ON.
  			 */
! 			if (zero_damaged_pages || InRecovery)
  			{
  

Re: [HACKERS] Avoiding unnecessary reads in recovery

2007-04-25 Thread Heikki Linnakangas

Heikki Linnakangas wrote:

While working on this, this comment in ReadBuffer caught my eye:


/*
 * During WAL recovery, the first access to any data page should
 * overwrite the whole page from the WAL; so a clobbered page
 * header is not reason to fail.  Hence, when InRecovery we may
 * always act as though zero_damaged_pages is ON.
 */
if (zero_damaged_pages || InRecovery)
{


But that assumption only holds if full_page_writes is enabled, right? I 
changed that in the attached patch as well, but if it isn't accepted 
that part of it should still be applied, I think.


On second thought, my fix still isn't 100% right because one could turn 
full_page_writes on before starting replay.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Avoiding unnecessary reads in recovery

2007-04-25 Thread Gregory Stark
Heikki Linnakangas [EMAIL PROTECTED] writes:

 While working on this, this comment in ReadBuffer caught my eye:

  /*
   * During WAL recovery, the first access to any data page should
   * overwrite the whole page from the WAL; so a clobbered page
   * header is not reason to fail.  Hence, when InRecovery we may
   * always act as though zero_damaged_pages is ON.
   */
  if (zero_damaged_pages || InRecovery)
  {

 But that assumption only holds if full_page_writes is enabled, right? I 
 changed
 that in the attached patch as well, but if it isn't accepted that part of it
 should still be applied, I think.

Well it's only true if full_page_writes was on when the WAL was written. Which
isn't necessarily the same as saying it's enabled during recovery...

As long as there's a backup block in the log we can use it to clobber pages in
the heap -- which is what your patch effectively does anyways. If we're
replaying a log entry where there isn't a backup block and we find a damaged
page then we're in trouble. Either the damaged page was in a previous backup
block or it's the recovery itself that's damaging it. 

In the latter case it would be pretty useful to abort the recovery so the user
doesn't lose his backup and has a chance to recovery properly (possibly after
reporting and fixing the bug).

So in short I think with your patch this piece of code no longer has a role.
Either your patch kicks in and we never even look at the damaged page at all,
or we should be treating it as corrupt data and just check zero_damaged_pages
alone and not do anything special in recovery.

-- 
  Gregory Stark
  EnterpriseDB  http://www.enterprisedb.com


---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [HACKERS] Avoiding unnecessary reads in recovery

2007-04-25 Thread Heikki Linnakangas

Gregory Stark wrote:

So in short I think with your patch this piece of code no longer has a role.
Either your patch kicks in and we never even look at the damaged page at all,
or we should be treating it as corrupt data and just check zero_damaged_pages
alone and not do anything special in recovery.


Good point. Adjusted patch attached. I added some comments as well.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com
Index: src/backend/access/transam/xlogutils.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/access/transam/xlogutils.c,v
retrieving revision 1.49
diff -c -r1.49 xlogutils.c
*** src/backend/access/transam/xlogutils.c	5 Jan 2007 22:19:24 -	1.49
--- src/backend/access/transam/xlogutils.c	25 Apr 2007 14:27:05 -
***
*** 226,232 
  	if (blkno  lastblock)
  	{
  		/* page exists in file */
! 		buffer = ReadBuffer(reln, blkno);
  	}
  	else
  	{
--- 226,235 
  	if (blkno  lastblock)
  	{
  		/* page exists in file */
! 		if (init)
! 			buffer = ZapBuffer(reln, blkno);
! 		else
! 			buffer = ReadBuffer(reln, blkno);
  	}
  	else
  	{
Index: src/backend/storage/buffer/bufmgr.c
===
RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.216
diff -c -r1.216 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c	30 Mar 2007 18:34:55 -	1.216
--- src/backend/storage/buffer/bufmgr.c	25 Apr 2007 14:36:15 -
***
*** 17,22 
--- 17,26 
   *		and pin it so that no one can destroy it while this process
   *		is using it.
   *
+  * ZapBuffer() -- like ReadBuffer, but destroys the contents of the 
+  *		page. Used in recovery when the page is completely overwritten 
+  *		from WAL.
+  *
   * ReleaseBuffer() -- unpin a buffer
   *
   * MarkBufferDirty() -- mark a pinned buffer's contents as dirty.
***
*** 97,102 
--- 101,107 
  static void TerminateBufferIO(volatile BufferDesc *buf, bool clear_dirty,
    int set_flag_bits);
  static void buffer_write_error_callback(void *arg);
+ static Buffer ReadBuffer_common(Relation reln, BlockNumber blockNum, bool alloc_only);
  static volatile BufferDesc *BufferAlloc(Relation reln, BlockNumber blockNum,
  			bool *foundPtr);
  static void FlushBuffer(volatile BufferDesc *buf, SMgrRelation reln);
***
*** 121,126 
--- 126,155 
  Buffer
  ReadBuffer(Relation reln, BlockNumber blockNum)
  {
+ 	return ReadBuffer_common(reln, blockNum, false);
+ }
+ 
+ /*
+  * ZapBuffer -- like ReadBuffer, but doesn't read the contents of the page
+  *		from disk. The caller is expected to completely rewrite the page,
+  *		regardless of the current contents. This should only be used in
+  *		recovery where there's no concurrent readers that might see the
+  *		contents of the page before the caller rewrites it.
+  */
+ Buffer
+ ZapBuffer(Relation reln, BlockNumber blockNum)
+ {
+ 	Assert(InRecovery);
+ 
+ 	return ReadBuffer_common(reln, blockNum, true);
+ }
+ 
+ /*
+  * ReadBuffer_common -- common logic of ReadBuffer and ZapBuffer.
+  */
+ static Buffer
+ ReadBuffer_common(Relation reln, BlockNumber blockNum, bool alloc_only)
+ {
  	volatile BufferDesc *bufHdr;
  	Block		bufBlock;
  	bool		found;
***
*** 253,269 
  	}
  	else
  	{
  		smgrread(reln-rd_smgr, blockNum, (char *) bufBlock);
  		/* check for garbage data */
  		if (!PageHeaderIsValid((PageHeader) bufBlock))
  		{
  			/*
! 			 * During WAL recovery, the first access to any data page should
! 			 * overwrite the whole page from the WAL; so a clobbered page
! 			 * header is not reason to fail.  Hence, when InRecovery we may
! 			 * always act as though zero_damaged_pages is ON.
  			 */
! 			if (zero_damaged_pages || InRecovery)
  			{
  ereport(WARNING,
  		(errcode(ERRCODE_DATA_CORRUPTED),
--- 282,304 
  	}
  	else
  	{
+ 		/* 
+ 		 * Read in the page, unless the caller intends to overwrite it
+ 		 * and just wants us to allocate a buffer.
+ 		 */
+ 		if (!alloc_only)
+ 		{
  		smgrread(reln-rd_smgr, blockNum, (char *) bufBlock);
  		/* check for garbage data */
  		if (!PageHeaderIsValid((PageHeader) bufBlock))
  		{
  			/*
! 			 * We used to ignore corrupt pages in WAL recovery, but
! 			 * that was only ever safe if full_page_writes was enabled.
! 			 * Now the caller sets alloc_only to false if he intends to 
! 			 * overwrite the whole page, which is already checked above.
  			 */
! 			if (zero_damaged_pages)
  			{
  ereport(WARNING,
  		(errcode(ERRCODE_DATA_CORRUPTED),
***
*** 277,282 
--- 312,318 
   errmsg(invalid page header in block %u of relation \%s\,
  		blockNum, RelationGetRelationName(reln;
  		}
+ 		}
  	}
  
  	if (isLocalBuf)
Index: src/include/storage/bufmgr.h
===
RCS file: 

Re: [HACKERS] Avoiding unnecessary reads in recovery

2007-04-25 Thread Simon Riggs
On Wed, 2007-04-25 at 13:48 +0100, Heikki Linnakangas wrote:

 I was surprised how big a difference it makes, but when you think about 
 it it's logical. Without the patch, it's doing roughly the same I/O as 
 the test itself, reading in pages, modifying them, and writing them 
 back. With the patch, all the reads are done sequentially from the WAL, 
 and then written back in a batch at the end of the WAL replay which is a 
 lot more efficient.

Interesting patch.

It would be good to see a longer term test. I'd expect things to fall
back to about 50% of the time on a longer recovery where the writes need
to be written out of cache.

I was thinking of getting the bgwriter to help out to improve matters
there.


Patch-wise, I love the name ZapBuffer() but would think that
OverwriteBuffer() would be more descriptive.

As regards the zero_damaged_pages question, I raised that some time ago
but we didn't arrive at an explicit answer. All I would say is we can't
allow invalid pages in the buffer manager at any time, whatever options
we have requested, otherwise other code will fail almost immediately.
I'm not sure there's another option?

-- 
  Simon Riggs 
  EnterpriseDB   http://www.enterprisedb.com



---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Avoiding unnecessary reads in recovery

2007-04-25 Thread Tom Lane
Simon Riggs [EMAIL PROTECTED] writes:
 As regards the zero_damaged_pages question, I raised that some time ago
 but we didn't arrive at an explicit answer. All I would say is we can't
 allow invalid pages in the buffer manager at any time, whatever options
 we have requested, otherwise other code will fail almost immediately.

Yeah --- the proposed new bufmgr routine should probably explicitly zero
the content of the buffer.  It doesn't really matter in the context of
WAL recovery, since there can't be any concurrent access to the buffer,
but it'd make it safe to use in non-WAL contexts (I think there are
other places where we know we are going to init the page and so a
physical read is a waste of time).  Also, this would let the patch be

+   if (alloc_only)
+   MemSet...
+   else
smgrread...

and you don't need to hack out the PageHeaderIsValid test, since it will
allow zeroed pages.

Possibly ReadZeroedBuffer would be a better name?

regards, tom lane

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [HACKERS] Avoiding unnecessary reads in recovery

2007-04-25 Thread Heikki Linnakangas

Tom Lane wrote:

Simon Riggs [EMAIL PROTECTED] writes:

As regards the zero_damaged_pages question, I raised that some time ago
but we didn't arrive at an explicit answer. All I would say is we can't
allow invalid pages in the buffer manager at any time, whatever options
we have requested, otherwise other code will fail almost immediately.


Yeah --- the proposed new bufmgr routine should probably explicitly zero
the content of the buffer.  It doesn't really matter in the context of
WAL recovery, since there can't be any concurrent access to the buffer,
but it'd make it safe to use in non-WAL contexts (I think there are
other places where we know we are going to init the page and so a
physical read is a waste of time).


Is there? I can't think of any. Extending a relation doesn't count.

Zeroing the buffer explicitly might be a good idea anyway. I agree it 
feels a bit dangerous to have a moment when there's a garbled page in 
buffer cache, even if only in recovery.



Also, this would let the patch be

+   if (alloc_only)
+   MemSet...
+   else
smgrread...

and you don't need to hack out the PageHeaderIsValid test, since it will
allow zeroed pages.


Well, I'd still put the PageHeaderIsValid test in the else-branch. Not 
like the few cycles spent on the test matter, but I don't see a reason 
not to.



Possibly ReadZeroedBuffer would be a better name?


Yeah, sounds good. Read is a bit misleading, since it doesn't actually 
read in the buffer, but it's good that the name is closely related to 
ReadBuffer.


--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [HACKERS] Avoiding unnecessary reads in recovery

2007-04-25 Thread Tom Lane
Heikki Linnakangas [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 but it'd make it safe to use in non-WAL contexts (I think there are
 other places where we know we are going to init the page and so a
 physical read is a waste of time).

 Is there? I can't think of any. Extending a relation doesn't count.

No, but re-using a free page in an index does.  I'm not sure which index
AMs know for sure the page is free, and which have to read it and check,
but I think there's at least some scope for that.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster