Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-16 Thread Greg Stark
On Sat, Feb 15, 2014 at 11:45 AM, Andres Freund and...@2ndquadrant.com wrote:
 I guess the theoretically correct thing would be to make all WAL records
 about truncation and unlinking contain the current size of the relation,
 but especially with deletions and forks that will probably turn out to
 be annoying to do.

Here's another alternative.

In md.c when extending a file to RELSEG_SIZE always check if the next
segment is already there and truncate it if it is to avoid magically
slurping in that data. That maintains the invariant that the first
short segment will mark the end of the relation. If you have a short
or missing segment then you'll ignore all the later segments.

I think to make this work you would have to sync the newly truncated
segment first before extending the current segment though. And this
would cause every relation extension to do an extra filesystem lookup.
Perhaps only doing this in recovery or *with assertions enabled?)
would mitigate that cost.

This makes a mockery of the comment in xlogutils.c that we would
rather not lose data in the case of a lost inode. But I feel like the
data in the later segments was already lost before the earlier segment
was filled up, it hardly helps matters if it can sometimes be unlost
if the earlier data happens to get written to in a particular pattern.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-15 Thread Andres Freund
On 2014-02-14 22:30:45 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  On 2014-02-14 20:46:01 +, Greg Stark wrote:
  Going over this I think this is still a potential issue:
  On 31 Jan 2014 15:56, Andres Freund and...@2ndquadrant.com wrote:
  I am not sure that explains the issue, but I think the redo action for
  truncation is not safe across crashes.  A XLOG_SMGR_TRUNCATE will just
  do a smgrtruncate() (and then mdtruncate) which will iterate over the
  segments starting at 0 till mdnblocks()/segment_size and *truncate* but
  not delete individual segment files that are not needed anymore, right?
  If we crash in the midst of that a new mdtruncate() will be issued, but
  it will get a shorter value back from mdnblocks().

 We could probably fix things so it deleted backwards; it'd be a tad
 tedious because the list structure isn't organized that way, but we
 could do it.

We could just make the list a doubly linked one, that'd make it simple.

 Not sure if that's good enough though.  If you don't
 want to assume the filesystem metadata is coherent after a crash,
 we might have nonzero-size segments after zero-size ones, even if
 the truncate calls had been issued in the right order.

I don't think that can actually happen on any realistic/interesting
FS. Metadata updates better be journaled, so while they might not
persist because the journal wasn't flushed, they should be applied in a
sane order after a crash.
But nonetheless I am not sure we want to rely on that.

 Another possibility is to keep opening and truncating files until
 we don't find the next segment in sequence, looking directly at the
 filesystem not at the mdfd chain.  I don't think this would be
 appropriate in normal operation, but we could do it if InRecovery
 (and maybe even only if we don't think the database is consistent?)

Yes, I was thinking of simply having a mdnblocks() variant that looks
for the last existing file, disregarding the size. But looking around,
it seems mdunlinkfork() has a similar issue, and I don't see how such a
trick could be applied there :(

I guess the theoretically correct thing would be to make all WAL records
about truncation and unlinking contain the current size of the relation,
but especially with deletions and forks that will probably turn out to
be annoying to do.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-14 Thread Greg Stark
Going over this I think this is still a potential issue:

On 31 Jan 2014 15:56, Andres Freund and...@2ndquadrant.com wrote:


 I am not sure that explains the issue, but I think the redo action for
 truncation is not safe across crashes.  A XLOG_SMGR_TRUNCATE will just
 do a smgrtruncate() (and then mdtruncate) which will iterate over the
 segments starting at 0 till mdnblocks()/segment_size and *truncate* but
 not delete individual segment files that are not needed anymore, right?
 If we crash in the midst of that a new mdtruncate() will be issued, but
 it will get a shorter value back from mdnblocks().

 Am I missing something?


I'm not too familiar with md.c but my reading of the code is that we
truncate the files in reverse order? In which case I think the code is safe
*iff* the filesystem guarantees ordered meta data writes which I tihnk ext3
does (I think in all the journal modes). Most filesystems meta data writes
are synchronous so the truncates are safe for them too.

But we don't generally rely on meta data writes being ordered. I think the
correct thing to do is to record the nblocks prior to the truncate and
then have md.c expose a new function that takes that parameter and pokes
around looking for any segments it might need to clean up. But that would
involve lots of abstraction violations in md.c. I think using nblocks would
keep the violations within md.c but that still seems like a pain.


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-14 Thread Andres Freund
On 2014-02-14 20:46:01 +, Greg Stark wrote:
 Going over this I think this is still a potential issue:
 
 On 31 Jan 2014 15:56, Andres Freund and...@2ndquadrant.com wrote:
 
 
  I am not sure that explains the issue, but I think the redo action for
  truncation is not safe across crashes.  A XLOG_SMGR_TRUNCATE will just
  do a smgrtruncate() (and then mdtruncate) which will iterate over the
  segments starting at 0 till mdnblocks()/segment_size and *truncate* but
  not delete individual segment files that are not needed anymore, right?
  If we crash in the midst of that a new mdtruncate() will be issued, but
  it will get a shorter value back from mdnblocks().
 
  Am I missing something?
 
 
 I'm not too familiar with md.c but my reading of the code is that we
 truncate the files in reverse order?

That's what I had assumed as well, but it doesn't look that way:

v = mdopen(reln, forknum, EXTENSION_FAIL);

priorblocks = 0; /* - initial value */
while (v != NULL)
{
MdfdVec*ov = v;

if (priorblocks  nblocks)
{
/* truncate entire segment */
}
else if (priorblocks + ((BlockNumber) RELSEG_SIZE)  nblocks)
{
/* truncate entire segment */
}
else
{
/* nothing to do, all needed */
}
priorblocks += RELSEG_SIZE;
}

So, according to that we start at segment 0, right?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-14 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2014-02-14 20:46:01 +, Greg Stark wrote:
 Going over this I think this is still a potential issue:
 On 31 Jan 2014 15:56, Andres Freund and...@2ndquadrant.com wrote:
 I am not sure that explains the issue, but I think the redo action for
 truncation is not safe across crashes.  A XLOG_SMGR_TRUNCATE will just
 do a smgrtruncate() (and then mdtruncate) which will iterate over the
 segments starting at 0 till mdnblocks()/segment_size and *truncate* but
 not delete individual segment files that are not needed anymore, right?
 If we crash in the midst of that a new mdtruncate() will be issued, but
 it will get a shorter value back from mdnblocks().

 I'm not too familiar with md.c but my reading of the code is that we
 truncate the files in reverse order?

 That's what I had assumed as well, but it doesn't look that way:

No, it's deleting forward.

We could probably fix things so it deleted backwards; it'd be a tad
tedious because the list structure isn't organized that way, but we
could do it.  Not sure if that's good enough though.  If you don't
want to assume the filesystem metadata is coherent after a crash,
we might have nonzero-size segments after zero-size ones, even if
the truncate calls had been issued in the right order.

Another possibility is to keep opening and truncating files until
we don't find the next segment in sequence, looking directly at the
filesystem not at the mdfd chain.  I don't think this would be
appropriate in normal operation, but we could do it if InRecovery
(and maybe even only if we don't think the database is consistent?)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-13 Thread Andrea Suisani

Hi all,

On 02/12/2014 08:27 PM, Greg Stark wrote:

On Wed, Feb 12, 2014 at 6:55 PM, Tom Lane t...@sss.pgh.pa.us wrote:

Greg Stark st...@mit.edu writes:

For what it's worth I've confirmed the bug in wal-e caused the initial
problem.


Huh?  Bug in wal-e?  What bug?


WAL-E actually didn't restore a whole 1GB file due to a transient S3
problem, in fact a bunch of them. It's remarkable that Postgres kept
going with that much data missing. But the arithmetic worked out on
the case I checked it on, which was the last one that I just sent the
xlog record for last night. In that case there was precisely one
segment missing and the relation was extended by the number of
segments you would expect if it filled in that missing segment and
then jumped to the end of the relation.


sorry for interrupting, but did we already notify wal-e's maintainer?

Andrea

ps cc:ed Daniel





--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-13 Thread Greg Stark
 I think what you're arguing is that we should see WAL records filling the
 rest of segment 1 before we see any references to segment 2, but if that's
 the case then how did we get into the situation you reported?  Or is it
 just that it was a broken base backup to start with?

The scenario I could come up with that didn't require a broken base backup
was that there was an earlier truncate or vacuum. So the sequence is high
offset reference, truncate, growth, crash. All possibly on a single
database.

It's possible we're better off not assuming we've thought of all possible
ways this can happen though.


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-13 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 I think what you're arguing is that we should see WAL records filling the
 rest of segment 1 before we see any references to segment 2, but if that's
 the case then how did we get into the situation you reported?  Or is it
 just that it was a broken base backup to start with?

 The scenario I could come up with that didn't require a broken base backup
 was that there was an earlier truncate or vacuum. So the sequence is high
 offset reference, truncate, growth, crash. All possibly on a single
 database.

That's not really an issue, because then it would be OK to discard the
high-offset update; we'd recognize that as safe when we replay the
truncation.

 It's possible we're better off not assuming we've thought of all possible
 ways this can happen though.

That's what's bothering me, too.  On the other hand, if we can't think of
a scenario where it'd be necessary to replay the high-offset update, then
I'm disinclined to mess with the code further.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-13 Thread Greg Stark
On Thu, Feb 13, 2014 at 7:52 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 The scenario I could come up with that didn't require a broken base backup
 was that there was an earlier truncate or vacuum. So the sequence is high
 offset reference, truncate, growth, crash. All possibly on a single
 database.

 That's not really an issue, because then it would be OK to discard the
 high-offset update; we'd recognize that as safe when we replay the
 truncation.

Yeah, that's my point.

 It's possible we're better off not assuming we've thought of all possible
 ways this can happen though.

 That's what's bothering me, too.  On the other hand, if we can't think of
 a scenario where it'd be necessary to replay the high-offset update, then
 I'm disinclined to mess with the code further.

And the whole point of the undefined page error checking is to detect
cases like this, so covering them up in the name of possible edge
cases we haven't thought of kind of defeats the purpose. In particular
I would have liked to get errors rather than soldier on when the
database found these missing segments.

In that vein, the other possibly open question was how we got past the
undefined pages errors that we did see. Andres said he thought that
was due to the bug where some piece of code was mistakenly using the
presence of a snapshot but I'm not clear how that can cause this
though.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-13 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 On Thu, Feb 13, 2014 at 7:52 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 That's what's bothering me, too.  On the other hand, if we can't think of
 a scenario where it'd be necessary to replay the high-offset update, then
 I'm disinclined to mess with the code further.

 And the whole point of the undefined page error checking is to detect
 cases like this, so covering them up in the name of possible edge
 cases we haven't thought of kind of defeats the purpose.

Yeah, good point.  Let's leave it alone and see if the existing fix
helps.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-12 Thread Greg Stark
So I think I've come up with a scenario that could cause this. I don't
think it's exactly what happened here but maybe something analogous
happened with our base backup restore.

On the primary you extend a table a bunch, including adding new
segments, but crash before committing (or checkpointing). Then some of
the blocks but not all may be written to disk. Assume they're all
written except for the last block of the first file. So what you have
is a .999G file followed by, day, 9 1G files. (Or maybe the hot backup
process could just catch the files in this state if a table is rapidly
growing and it doesn't take care to avoid picking up new files that
appear after it starts?)

smgrnblocks() stops at the first  1GB segment and ignores the rest.
This code in xlog uses it to calculate how many blocks to add but it
only calls it once and then doesn't recheck where it's at as it
extends the relation. As soon as it adds that one missing block the
remaining files become visible. P_NEW always recalculates the position
based on smgrnblocks each time (which sounds pretty  inefficient but
anyways) so it will add the requested blocks to the new end.

Now this isn't enough to explain things since surely the extensions
records would be in the xlog in physical order. But this could have
all happened after an earlier vacuum truncated the relation and we
could be replaying records that predate that.

So in short, if you have a 10G table and want to overwrite the last
block but the first segment is one block short then xlog will add 9G
to the end and write the block there. That sounds like what we've
seen.

I think the easy fix is to change the code in xlogutils to be more
defensive and stop as soon as it finds BufferGetBlockNumber(buffer) ==
blkno (which is what it has in the assert already).
-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-12 Thread Greg Stark
So here's my attempt to rewrite this logic. I ended up refactoring a
bit because I found it unnecessarily confusing having the mode
branches in several places. I think it's much clearer just having two
separate pieces of logic for RBM_NEW and the extension cases since all
they have in common is the ReadBuffer call.

I have to say, it scares the hell out of me that there are no
regression tests for this code. I'm certainly not comfortable
committing it without a few other people reading it if I haven't even
run the code once. At least I know it compiles...
*** a/src/backend/access/transam/xlogutils.c
--- b/src/backend/access/transam/xlogutils.c
***
*** 293,300  Buffer
  XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
  	   BlockNumber blkno, ReadBufferMode mode)
  {
- 	BlockNumber lastblock;
  	Buffer		buffer;
  	SMgrRelation smgr;
  
  	Assert(blkno != P_NEW);
--- 293,300 
  XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
  	   BlockNumber blkno, ReadBufferMode mode)
  {
  	Buffer		buffer;
+ 	Page 		page;
  	SMgrRelation smgr;
  
  	Assert(blkno != P_NEW);
***
*** 312,353  XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
  	 */
  	smgrcreate(smgr, forknum, true);
  
! 	lastblock = smgrnblocks(smgr, forknum);
! 
! 	if (blkno  lastblock)
! 	{
! 		/* page exists in file */
! 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
! 		   mode, NULL);
! 	}
! 	else
  	{
  		/* hm, page doesn't exist in file */
! 		if (mode == RBM_NORMAL)
  		{
  			log_invalid_page(rnode, forknum, blkno, false);
  			return InvalidBuffer;
  		}
- 		/* OK to extend the file */
- 		/* we do this in recovery only - no rel-extension lock needed */
- 		Assert(InRecovery);
- 		buffer = InvalidBuffer;
- 		while (blkno = lastblock)
- 		{
- 			if (buffer != InvalidBuffer)
- ReleaseBuffer(buffer);
- 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
- 			   P_NEW, mode, NULL);
- 			lastblock++;
- 		}
- 		Assert(BufferGetBlockNumber(buffer) == blkno);
- 	}
  
! 	if (mode == RBM_NORMAL)
! 	{
! 		/* check that page has been initialized */
! 		Page		page = (Page) BufferGetPage(buffer);
  
  		/*
  		 * We assume that PageIsNew is safe without a lock. During recovery,
  		 * there should be no other backends that could modify the buffer at
--- 312,332 
  	 */
  	smgrcreate(smgr, forknum, true);
  
! 	if (mode == RBM_NORMAL)
  	{
  		/* hm, page doesn't exist in file */
! 		if (blkno = smgrnblocks(smgr, forknum))
  		{
  			log_invalid_page(rnode, forknum, blkno, false);
  			return InvalidBuffer;
  		}
  
! 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
! 		   mode, NULL);
  
+ 		/* check that page has been initialized */
+ 		page = (Page) BufferGetPage(buffer);
+ 		
  		/*
  		 * We assume that PageIsNew is safe without a lock. During recovery,
  		 * there should be no other backends that could modify the buffer at
***
*** 359,366  XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
  			log_invalid_page(rnode, forknum, blkno, true);
  			return InvalidBuffer;
  		}
- 	}
  
  	return buffer;
  }
  
--- 338,364 
  			log_invalid_page(rnode, forknum, blkno, true);
  			return InvalidBuffer;
  		}
  
+ 	} else {
+ 
+ 		/* If page doesn't exist extend the file */
+ 		while (blkno = smgrnblocks(smgr, forknum)) 
+ 		{
+ 			/* we do this in recovery only - no rel-extension lock needed */
+ 			Assert(InRecovery);
+ 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
+ 			   P_NEW, mode, NULL);
+ 			/* Take care not to extend the relation past where it's needed */
+ 			if (BufferGetBlockNumber(buffer) == blkno) 
+ return buffer;
+ 			ReleaseBuffer(buffer);
+ 		}
+ 
+ 		/* page now exists in file */
+ 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
+ 		   mode, NULL);
+ 	}
+ 	
  	return buffer;
  }
  

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-12 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 So here's my attempt to rewrite this logic. I ended up refactoring a
 bit because I found it unnecessarily confusing having the mode
 branches in several places. I think it's much clearer just having two
 separate pieces of logic for RBM_NEW and the extension cases since all
 they have in common is the ReadBuffer call.

I don't like that at all.  It's a lot of unnecessary churn in what is
indeed hard-to-test code, and personally I don't find it clearer.
Nor, if you're going to complain about the cost of smgrnblocks, does
it seem like a great idea to be doing that *twice* per page rather
than once.

How about the attached instead?

regards, tom lane

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 4cd82df..f1918f6 100644
*** a/src/backend/access/transam/xlogutils.c
--- b/src/backend/access/transam/xlogutils.c
*** XLogReadBufferExtended(RelFileNode rnode
*** 338,352 
  		/* we do this in recovery only - no rel-extension lock needed */
  		Assert(InRecovery);
  		buffer = InvalidBuffer;
! 		while (blkno = lastblock)
  		{
  			if (buffer != InvalidBuffer)
  ReleaseBuffer(buffer);
  			buffer = ReadBufferWithoutRelcache(rnode, forknum,
  			   P_NEW, mode, NULL);
- 			lastblock++;
  		}
! 		Assert(BufferGetBlockNumber(buffer) == blkno);
  	}
  
  	if (mode == RBM_NORMAL)
--- 338,358 
  		/* we do this in recovery only - no rel-extension lock needed */
  		Assert(InRecovery);
  		buffer = InvalidBuffer;
! 		do
  		{
  			if (buffer != InvalidBuffer)
  ReleaseBuffer(buffer);
  			buffer = ReadBufferWithoutRelcache(rnode, forknum,
  			   P_NEW, mode, NULL);
  		}
! 		while (BufferGetBlockNumber(buffer)  blkno);
! 		/* Handle the corner case that P_NEW returns non-consecutive pages */
! 		if (BufferGetBlockNumber(buffer) != blkno)
! 		{
! 			ReleaseBuffer(buffer);
! 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
! 			   mode, NULL);
! 		}
  	}
  
  	if (mode == RBM_NORMAL)

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-12 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 So I think I've come up with a scenario that could cause this. I don't
 think it's exactly what happened here but maybe something analogous
 happened with our base backup restore.

I agree it seems like a good idea for XLogReadBufferExtended to defend
itself against successive P_NEW calls possibly not returning consecutive
pages.  However, unless you had an operating-system-level crash on the
master, this isn't sufficient to explain the problem.  We'd need also a
plausible theory about how a base backup could've left us with short
segments in a large relation.

 (Or maybe the hot backup
 process could just catch the files in this state if a table is rapidly
 growing and it doesn't take care to avoid picking up new files that
 appear after it starts?)

That's a possible explanation I guess, but it doesn't seem terribly
probable from a timing standpoint.  Also, you should be able to gauge
the probability of this theory from knowledge of the application ---
are the bloated relations all ones that would've been growing *very*
rapidly during the base backup?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-12 Thread Greg Stark
On Wed, Feb 12, 2014 at 5:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 How about the attached instead?

This does possibly allocate an extra block past the target block. I'm
not sure how surprising that would be for the rest of the code.

For what it's worth I've confirmed the bug in wal-e caused the initial
problem. But I think it's possible to occur without that bug so I
think it still needs to be addressed.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-12 Thread Tom Lane
I wrote:
 Greg Stark st...@mit.edu writes:
 (Or maybe the hot backup
 process could just catch the files in this state if a table is rapidly
 growing and it doesn't take care to avoid picking up new files that
 appear after it starts?)

 That's a possible explanation I guess, but it doesn't seem terribly
 probable from a timing standpoint.

I did a bit of arithmetic using the cases you posted previously.
In the first case, where block 3634978 got written to 7141472,
you can make the numbers come out right if you assume that a page
was missing at the end of segment 1 --- that leads to the conclusion
that EOF exclusive of that missing page had been around 28.75 GB,
which squares well with the relation's size on master.  However, it's
fairly hard to credit that the base backup would have collected
full-size or nearly full-size images of segments 2 through 28 while
not seeing segment 1 at full size.  You'd have to assume that the
rel grew by a factor of ~14 while the base backup was in progress
--- and then didn't grow very much more afterwards.  (What state
exactly did you measure the primary rel sizes in?  Was it long
after the backup/restore, or did you rewind things somehow?)

The other examples seem to fit the theory a bit better, but this
one is hard to explain this way.

The other big problem for this theory is that you said in
http://www.postgresql.org/message-id/cam-w4hpvjcbrvv3dxg8aj0wzku08dhux-xybfdyqhnrn5bn...@mail.gmail.com

 What's worse is we created a new standby from the same base backup and
 replayed the same records and it didn't reproduce the problem.

If this were the explanation, it oughta be reproducible that way.

I still agree that XLogReadBufferExtended shouldn't be assuming that P_NEW
will not skip pages.  But I think we have another bug in here somewhere.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-12 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 On Wed, Feb 12, 2014 at 5:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 How about the attached instead?

 This does possibly allocate an extra block past the target block. I'm
 not sure how surprising that would be for the rest of the code.

Should be fine; we could end up with an extra block after a failed
extension operation in any case.

 For what it's worth I've confirmed the bug in wal-e caused the initial
 problem.

Huh?  Bug in wal-e?  What bug?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-12 Thread Greg Stark
On Wed, Feb 12, 2014 at 6:55 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Greg Stark st...@mit.edu writes:
 On Wed, Feb 12, 2014 at 5:29 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 How about the attached instead?

 This does possibly allocate an extra block past the target block. I'm
 not sure how surprising that would be for the rest of the code.

 Should be fine; we could end up with an extra block after a failed
 extension operation in any case.

I know it's fine on the active database, I'm not so clear whether it's
compatible with the xlog records from the primary. I suppose it'll
just see an Initialize Page record and happily see the nul block and
initialize it. It's still a bit scary.

I of course think my code is vastly clearer than the existing code but
yes, I see the unnecessary churn argument. I think that's the
fundamental problem with lacking tests, it makes the code get more and
more complex as we're reluctant to simplify it out of fear.

 For what it's worth I've confirmed the bug in wal-e caused the initial
 problem.

 Huh?  Bug in wal-e?  What bug?

WAL-E actually didn't restore a whole 1GB file due to a transient S3
problem, in fact a bunch of them. It's remarkable that Postgres kept
going with that much data missing. But the arithmetic worked out on
the case I checked it on, which was the last one that I just sent the
xlog record for last night. In that case there was precisely one
segment missing and the relation was extended by the number of
segments you would expect if it filled in that missing segment and
then jumped to the end of the relation.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-12 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 On Wed, Feb 12, 2014 at 6:55 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Greg Stark st...@mit.edu writes:
 This does possibly allocate an extra block past the target block. I'm
 not sure how surprising that would be for the rest of the code.

 Should be fine; we could end up with an extra block after a failed
 extension operation in any case.

 I know it's fine on the active database, I'm not so clear whether it's
 compatible with the xlog records from the primary. I suppose it'll
 just see an Initialize Page record and happily see the nul block and
 initialize it. It's still a bit scary.

Well, we can easily find uninitialized extra pages on the primary too,
so if WAL replay were unable to cope with that, it would be a bug
regardless.

 Huh?  Bug in wal-e?  What bug?

 WAL-E actually didn't restore a whole 1GB file due to a transient S3
 problem, in fact a bunch of them.

Hah.  Okay, I think we can write this issue off as closed then.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-12 Thread Tom Lane
I wrote:
 Greg Stark st...@mit.edu writes:
 WAL-E actually didn't restore a whole 1GB file due to a transient S3
 problem, in fact a bunch of them.

 Hah.  Okay, I think we can write this issue off as closed then.

Oh, wait a minute.  It's not just a matter of whether we find the right
block: we also have to consider whether XLogReadBufferExtended will
apply the right mode behavior.  Currently, it supposes that all pages
past the initially observed EOF should be assumed to be uninitialized;
but if we're working with an inconsistent database, that seems like
an unsafe assumption.  It might be that a page is there but we've not
(yet) fixed the length of some preceding segment.  If we want to not
get bogus WAL contains references to invalid pages failures in such
scenarios, it seems like we need a more invasive change than what
I just committed.  I think your patch didn't cover this consideration
either.

What I think we probably want to do is forcibly cause the target page
to exist, using a P_NEW loop like what I committed, and then decide
on the basis of whether it's all-zeroes whether to consider it invalid
or not.  This seems sane on the grounds that it's just the extension
to the page level of the existing policy of creating the file whether
it existed or not.  It could only result in a large amount of wasted
work if the passed-in target block is insane --- but since we got it
out of a CRC-checked WAL record, I think it's safe to not worry too
much about that.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-12 Thread Tom Lane
I wrote:
 What I think we probably want to do is forcibly cause the target page
 to exist, using a P_NEW loop like what I committed, and then decide
 on the basis of whether it's all-zeroes whether to consider it invalid
 or not.  This seems sane on the grounds that it's just the extension
 to the page level of the existing policy of creating the file whether
 it existed or not.  It could only result in a large amount of wasted
 work if the passed-in target block is insane --- but since we got it
 out of a CRC-checked WAL record, I think it's safe to not worry too
 much about that.

Like the attached.  A possible complaint is that if the WAL sequence
contains updates against large relations that are later dropped,
this will waste time and disk space replaying those updates as best
it can.  Doesn't seem like that's a case to optimize for, however.

regards, tom lane

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index f1918f6..7a820c0 100644
*** a/src/backend/access/transam/xlogutils.c
--- b/src/backend/access/transam/xlogutils.c
*** XLogReadBuffer(RelFileNode rnode, BlockN
*** 277,297 
   * XLogReadBufferExtended
   *		Read a page during XLOG replay
   *
!  * This is functionally comparable to ReadBufferExtended. There's some
!  * differences in the behavior wrt. the mode argument:
   *
!  * In RBM_NORMAL mode, if the page doesn't exist, or contains all-zeroes, we
!  * return InvalidBuffer. In this case the caller should silently skip the
!  * update on this page. (In this situation, we expect that the page was later
!  * dropped or truncated. If we don't see evidence of that later in the WAL
!  * sequence, we'll complain at the end of WAL replay.)
   *
   * In RBM_ZERO and RBM_ZERO_ON_ERROR modes, if the page doesn't exist, the
   * relation is extended with all-zeroes pages up to the given block number.
   *
!  * In RBM_NORMAL_NO_LOG mode, we return InvalidBuffer if the page doesn't
!  * exist, and we don't check for all-zeroes.  Thus, no log entry is made
!  * to imply that the page should be dropped or truncated later.
   */
  Buffer
  XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
--- 277,307 
   * XLogReadBufferExtended
   *		Read a page during XLOG replay
   *
!  * This is functionally comparable to ReadBufferExtended, except that we
!  * are willing to create the target page (and indeed the whole relation)
!  * if it doesn't currently exist.  This allows safe replay of WAL sequences
!  * in which a relation was later dropped or truncated.
   *
!  * The mode argument provides some control over this behavior.  (See also
!  * ReadBufferExtended's specification of what the modes do.)
   *
   * In RBM_ZERO and RBM_ZERO_ON_ERROR modes, if the page doesn't exist, the
   * relation is extended with all-zeroes pages up to the given block number.
+  * These modes should be used if the caller is going to initialize the page
+  * contents from scratch, and doesn't need it to be valid already.
   *
!  * In RBM_NORMAL mode, we similarly create the page if needed, but if the
!  * page contains all zeroes (including the case where we just created it),
!  * we return InvalidBuffer.  Then the caller should silently skip the update
!  * on this page.  This mode should be used for incremental updates where the
!  * caller needs to see a valid page.  (In this case, we expect that the page
!  * later gets dropped or truncated. If we don't see evidence of that later in
!  * the WAL sequence, we'll complain at the end of WAL replay.)
!  *
!  * RBM_NORMAL_NO_LOG mode is like RBM_NORMAL except that we will return an
!  * all-zeroes page, and not log it as one that ought to get dropped later.
!  * This mode is for when the caller wants to read a page that might validly
!  * contain zeroes.
   */
  Buffer
  XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
*** XLogReadBufferExtended(RelFileNode rnode
*** 299,304 
--- 309,315 
  {
  	BlockNumber lastblock;
  	Buffer		buffer;
+ 	bool		present;
  	SMgrRelation smgr;
  
  	Assert(blkno != P_NEW);
*** XLogReadBufferExtended(RelFileNode rnode
*** 316,342 
  	 */
  	smgrcreate(smgr, forknum, true);
  
  	lastblock = smgrnblocks(smgr, forknum);
  
  	if (blkno  lastblock)
  	{
  		/* page exists in file */
  		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
  		   mode, NULL);
  	}
  	else
  	{
! 		/* hm, page doesn't exist in file */
! 		if (mode == RBM_NORMAL)
! 		{
! 			log_invalid_page(rnode, forknum, blkno, false);
! 			return InvalidBuffer;
! 		}
! 		if (mode == RBM_NORMAL_NO_LOG)
! 			return InvalidBuffer;
! 		/* OK to extend the file */
  		/* we do this in recovery only - no rel-extension lock needed */
  		Assert(InRecovery);
  		buffer = InvalidBuffer;
  		do
  		{
--- 327,357 
  	 */
  	smgrcreate(smgr, forknum, true);
  
+ 	/*
+ 	 * On the same principle, if the page doesn't already exist in the 

Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-12 Thread Greg Stark
On Wed, Feb 12, 2014 at 8:28 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Oh, wait a minute.  It's not just a matter of whether we find the right
 block: we also have to consider whether XLogReadBufferExtended will
 apply the right mode behavior.  Currently, it supposes that all pages
 past the initially observed EOF should be assumed to be uninitialized;
 but if we're working with an inconsistent database, that seems like
 an unsafe assumption.  It might be that a page is there but we've not
 (yet) fixed the length of some preceding segment.  If we want to not
 get bogus WAL contains references to invalid pages failures in such
 scenarios, it seems like we need a more invasive change than what
 I just committed.  I think your patch didn't cover this consideration
 either.

Hm. I *think* those cases would be handled anyways since the table
would later be truncated. Arguably any reference after the short
segment is a reference to an invalid page since it means it's a
record which predates the records which caused the extension.
Obviously it would still give the error in the case we had where files
were missing but then probably it should give errors in that case.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-12 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 On Wed, Feb 12, 2014 at 8:28 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Oh, wait a minute.  It's not just a matter of whether we find the right
 block: we also have to consider whether XLogReadBufferExtended will
 apply the right mode behavior.  Currently, it supposes that all pages
 past the initially observed EOF should be assumed to be uninitialized;
 but if we're working with an inconsistent database, that seems like
 an unsafe assumption.  It might be that a page is there but we've not
 (yet) fixed the length of some preceding segment.  If we want to not
 get bogus WAL contains references to invalid pages failures in such
 scenarios, it seems like we need a more invasive change than what
 I just committed.  I think your patch didn't cover this consideration
 either.

 Hm. I *think* those cases would be handled anyways since the table
 would later be truncated. Arguably any reference after the short
 segment is a reference to an invalid page since it means it's a
 record which predates the records which caused the extension.

Well, that would be the case if you assume perfectly sequential filesystem
behavior, but I'm not sure the assumption holds if the starting condition
is a base backup.  We could be looking at a version of segment 1 that
predates segment 2's existence, and yet see some data in segment 2 as
well, because it's not a perfectly coherent snapshot.

I think what you're arguing is that we should see WAL records filling the
rest of segment 1 before we see any references to segment 2, but if that's
the case then how did we get into the situation you reported?  Or is it
just that it was a broken base backup to start with?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-11 Thread Greg Stark
On Sun, Feb 9, 2014 at 2:54 PM, Greg Stark st...@mit.edu wrote:

 Bad block's page header -- this is in the 56'th relation segment:

 =# select 
 (page_header(E'\\x2005583b05aa050028001805002004201098e00f2090e00f088d24061885e00f')).*;
  lsn  | tli | flags | lower | upper | special | pagesize |
 version | prune_xid
 --+-+---+---+---+-+--+-+---
  520/AA053B58 |   5 | 0 |40 |  1304 |8192 | 8192 |
   4 | 0


 Looking at the block at offset 4773121 (which is in the 36th segment):
...
 d9de7pcqls4ib6=# select
 (page_header(E'\\x2005a00a0bad05002c00a002002004201098e00f2090e00f088d24061885e00fa082ec04')).*;
  lsn  | tli | flags | lower | upper | special | pagesize |
 version | prune_xid
 --+-+---+---+---+-+--+-+---
  520/AD0B0AA0 |   5 | 0 |44 |   672 |8192 | 8192 |
   4 | 0
 (1 row)

And I finally tracked down the xlog record for this stray write:

[cur:520/AA051FE0, xid:7635428, rmid:10(Heap), len/tot_len:21/7005,
info:8, prev:520/AA051F20] insert: s/d/r:1663/16385/16650
blk/off:4773121/4 header: none
[cur:520/AA051FE0, xid:7635428, rmid:10(Heap), len/tot_len:21/7005,
info:8, prev:520/AA051F20] bkpblock[1]: s/d/r:1663/16385/16650
blk:4773121 hole_off/len:40/1264



-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-09 Thread Greg Stark
On Thu, Feb 6, 2014 at 11:41 PM, Greg Stark st...@mit.edu wrote:

 That doesn't explain the other instance or the other copies of this
 database. I think the most productive thing I can do is switch my
 attention to the other database to see if it really looks like the
 same problem.

So here's an instance in the other database, this one is on a
different continent from the first one so it's definitely a different
physical machine. I've had to copy the blocks over to another machine
because the database is down and still in standby mode anyways. I
don't have the xlog file yet.

Bad block's page header -- this is in the 56'th relation segment:

=# select 
(page_header(E'\\x2005583b05aa050028001805002004201098e00f2090e00f088d24061885e00f')).*;
 lsn  | tli | flags | lower | upper | special | pagesize |
version | prune_xid
--+-+---+---+---+-+--+-+---
 520/AA053B58 |   5 | 0 |40 |  1304 |8192 | 8192 |
  4 | 0
(1 row)

=# select 
(heap_page_items(E'\\x2005583b05aa050028001805002004201098e00f2090e00f088d24061885e00f000')).*;
 lp | lp_off | lp_flags | lp_len | t_xmin  | t_xmax | t_field3 |
t_ctid| t_infomask2 | t_infomask | t_hoff | t_bits | t_oid
++--++-++--+-+-++++---
  1 |   6160 |1 |   2032 | 7635393 |  0 |0 |
(4773121,1) |   3 |   2306 | 24 ||
  2 |   4128 |1 |   2032 | 7635393 |  0 |0 |
(4773121,2) |   3 |   2306 | 24 ||
  3 |   3336 |1 |786 | 7635393 |  0 |0 |
(4773121,3) |   3 |   2306 | 24 ||
  4 |   1304 |1 |   2032 | 7635428 |  0 |0 |
(4773121,4) |   3 |   2306 | 24 ||
(4 rows)

Looking at the block at offset 4773121 (which is in the 36th segment):

=# select 
(heap_page_items(E'\\x2005a00a0bad05002c00a002002004201098e00f2090e00f088d24061885e00fa082ec04000.')).*;
 lp | lp_off | lp_flags | lp_len | t_xmin  | t_xmax | t_field3 |
t_ctid| t_infomask2 | t_infomask | t_hoff | t_bits | t_oid
++--++-++--+-+-++++---
  1 |   6160 |1 |   2032 | 7635393 |  0 |0 |
(4773121,1) |   3 |   2306 | 24 ||
  2 |   4128 |1 |   2032 | 7635393 |  0 |0 |
(4773121,2) |   3 |   2306 | 24 ||
  3 |   3336 |1 |786 | 7635393 |  0 |0 |
(4773121,3) |   3 |   2306 | 24 ||
  4 |   1304 |1 |   2032 | 7635428 |  0 |0 |
(4773121,4) |   3 |   2306 | 24 ||
  5 |672 |1 |630 | 7635580 |  0 |0 |
(4773121,5) |   3 |   2306 | 24 ||
(5 rows)

d9de7pcqls4ib6=# select
(page_header(E'\\x2005a00a0bad05002c00a002002004201098e00f2090e00f088d24061885e00fa082ec04')).*;
 lsn  | tli | flags | lower | upper | special | pagesize |
version | prune_xid
--+-+---+---+---+-+--+-+---
 520/AD0B0AA0 |   5 | 0 |44 |   672 |8192 | 8192 |
  4 | 0
(1 row)



-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-07 Thread Andres Freund
On 2014-02-06 20:06:03 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  That reminds me, not that I directly see how it could be responsible,
  there's still 20131029011623.gj20...@awork2.anarazel.de ff. around. I
  don't think we came to a agreement in that thread how to fix the
  problem.
 
 Hm, yeah.  I'm not sure I believe Heikki's argument that we need to avoid
 closing the smgr relation.  If that stuff is being used in any
 performance-critical paths then we've got trouble already.

Me neither. And I still am hesitant to start doing an unconditional
smgropen(rnode, InvalidBackendId), but maybe that's also misplaced. My
gut feeling would either to go with the very simple closing the smgr
(which was the case before the current form of the fake relcache
infrastructure!) or add support of unowning the smgr rel (as in
20131105193632.gd14...@awork2.anarazel.de).

After being slightly more awake it's even harder to see how it could be
the cause for this thread's problem. True, it's a rogue write into
palloc()ed memory that's used by somebody else, but afaics it will only
ever write a NULL. While not impossible it seems a bit of a stretch how
that could cause the symptoms.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-06 Thread Greg Stark
On Mon, Feb 3, 2014 at 12:02 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 What version were you running before 9.1.11 exactly?  I took a look
 through all the diffs from 9.1.9 up to 9.1.11, and couldn't find any
 changes that seemed even vaguely related to this.  There are some
 changes in known-transaction tracking, but it's hard to see a connection
 there.  Most of the other diffs are in code that wouldn't execute during
 WAL replay at all.


Both the primary and the standby were 9.1.11 from the get-go. The
database the primary was forked off of was 9.1.10 but as far as I can
tell the primary in the current pair has no problems.

What's worse is we created a new standby from the same base backup and
replayed the same records and it didn't reproduce the problem. This
means either it's a hardware problem -- but we've seen it on multiple
standbys on this database and at least one other database which is in
a different data centre -- or it's a race condition --but that's hard
to credit in the recovery code which is basically single-threaded.

And these records are from before the standby reaches a consistency so
it's hard to see how a connection from a hot standby client could
cause any kind of race condition. The only other thread that could
conceivably cause a heisenbug is the bgwriter. It's hard to imagine
how a race condition in there could be so easy to hit that it would
happen four times on one restore but otherwise go mostly unnoticed.
-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-06 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 Both the primary and the standby were 9.1.11 from the get-go. The
 database the primary was forked off of was 9.1.10 but as far as I can
 tell the primary in the current pair has no problems.

 What's worse is we created a new standby from the same base backup and
 replayed the same records and it didn't reproduce the problem. This
 means either it's a hardware problem -- but we've seen it on multiple
 standbys on this database and at least one other database which is in
 a different data centre -- or it's a race condition --but that's hard
 to credit in the recovery code which is basically single-threaded.

 And these records are from before the standby reaches a consistency so
 it's hard to see how a connection from a hot standby client could
 cause any kind of race condition. The only other thread that could
 conceivably cause a heisenbug is the bgwriter. It's hard to imagine
 how a race condition in there could be so easy to hit that it would
 happen four times on one restore but otherwise go mostly unnoticed.

I had noticed that the WAL records that were mis-replayed seemed to
be bunched pretty close together (two of them even adjacent).  Could
you confirm that?  If so, it seems like we're looking for some condition
that makes mis-replay fairly probable for a period of time, but in
itself might be quite improbable.  Not that that helps much at
nailing it down.

You might well be on to something with the bgwriter idea, considering
that none of the WAL replay code was originally written with any
concurrent execution in mind.  We might've missed some place where
additional locking is needed.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-06 Thread Greg Stark
On Thu, Feb 6, 2014 at 10:48 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 I had noticed that the WAL records that were mis-replayed seemed to
 be bunched pretty close together (two of them even adjacent).  Could
 you confirm that?  If so, it seems like we're looking for some condition
 that makes mis-replay fairly probable for a period of time, but in
 itself might be quite improbable.  Not that that helps much at
 nailing it down.

Well one thing that argues for is hardware problems. It could be that
the right memory mapping just happened to line up with that variable
on the stack for that short time and then it was mapped to something
else entirely. Or the machine was overheating for a short time and
then the temperature became more reasonable. Or the person with the
x-ray source walked by in that short time window.

That doesn't explain the other instance or the other copies of this
database. I think the most productive thing I can do is switch my
attention to the other database to see if it really looks like the
same problem.

 You might well be on to something with the bgwriter idea, considering
 that none of the WAL replay code was originally written with any
 concurrent execution in mind.  We might've missed some place where
 additional locking is needed.

Except that the bgwriter has been in there for a few years already.
Unless there's been some other change, possibly involving copying some
code that was safe in some context but not where it was copied to.

The problem with the bgwriter being at fault is that from what I can
see the bgwriter will never extend a file. That means the xlog
recovery code must have done it. That means even if the bgwriter came
along and looked at the buffer we just read in it would already be too
late to cause mischief. The xlog code extends the file *first* then
reads in the backup block into a buffer. I can't see how it could
corrupt the stack or the wal recovery buffer in the window between
reading in the wal buffer and deciding to extend the relation.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-06 Thread Andres Freund
On 2014-02-06 23:41:19 +0100, Greg Stark wrote:
 The problem with the bgwriter being at fault is that from what I can
 see the bgwriter will never extend a file. That means the xlog
 recovery code must have done it. That means even if the bgwriter came
 along and looked at the buffer we just read in it would already be too
 late to cause mischief. The xlog code extends the file *first* then
 reads in the backup block into a buffer. I can't see how it could
 corrupt the stack or the wal recovery buffer in the window between
 reading in the wal buffer and deciding to extend the relation.

That's not necessarily true. If e.g. the buffer mapping would change
racily, the result write from the bgwriter could very well end up
increasing the file size, leaving a hole inbetween its write and the
original size.
Are there any other relations that are as big as the corrupted relations
got extended to?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-06 Thread Greg Stark
On Thu, Feb 6, 2014 at 11:48 PM, Andres Freund and...@2ndquadrant.com wrote:

 That's not necessarily true. If e.g. the buffer mapping would change
 racily, the result write from the bgwriter could very well end up
 increasing the file size, leaving a hole inbetween its write and the
 original size.

a) the segment isn't sparse and b) there were whole segments full of
nuls between the end of the tables and the final blocks.

So the file was definitely extended by Postgres, not the OS and the
bgwriter passes EXTENSION_FAIL which means it wouldn't create those
intervening segments.

 Are there any other relations that are as big as the corrupted relations
 got extended to?

I was wondering the same thing. But no, the extended file is 39GB
larger than the next largest relation.

Also, the final block in the first relation is clearly a version of
the same block from the correct location. Look at the ctids and the
xids on the rows for lp 3 in the attached file for example. That
second copy is from the location in the xlog record.


-- 
greg
=# select * from (select relfilenode,(pg_relation_size(oid)/8192)::integer-1 as 
blockno,((pg_relation_size(oid)/8192)::integer-1)/1024/128 as 
segnum,((pg_relation_size(oid)/8192)::integer-1)%(1024*1024*128) as offset, 
(page_header(get_raw_page(oid::regclass::text,'main',(pg_relation_size(oid)/8192)::integer-1))).*
 from pg_class where pg_relation_size(oid)  4*8192 and 
(page_header(get_raw_page(oid::regclass::text,'main',(pg_relation_size(oid)/8192)::integer-2))).upper=0
 order by lsn) as x ;
 relfilenode | blockno  | segnum |  offset  |lsn | tli | flags | lower 
| upper | special | pagesize | version | prune_xid  
-+--++--++-+---+---+---+-+--+-+
  473158 | 18090059 |138 | 18090059 | EA1/625E28 |   6 | 0 |   144 
|   896 |8192 | 8192 |   4 | 1401029863
 1366221 |  2208159 | 16 |  2208159 | EA1/62DDF0 |   6 | 0 |  1180 
|  3552 |8176 | 8192 |   4 |  0
 1261982 |  7141472 | 54 |  7141472 | EA1/638988 |   6 | 0 |  1240 
|  3312 |8176 | 8192 |   4 |  0
 1364767 |  3631161 | 27 |  3631161 | EA1/63A0A8 |   6 | 0 |  1180 
|  3552 |8176 | 8192 |   4 |  0
 1519286 |   311808 |  2 |   311808 | EA1/708B08 |   6 | 1 |   312 
|  3040 |8192 | 8192 |   4 |  0
(5 rows)

=# select 
(heap_page_items(get_raw_page(16531::regclass::text,'main',18090059))).*;
 lp | lp_off | lp_flags | lp_len |   t_xmin   |   t_xmax   | t_field3 |
t_ctid| t_infomask2 | t_infomask | t_hoff | 
 t_bits  | t_oid 
++--++++--+--+-+++--+---
  1 |   7024 |1 |328 | 1358232426 | 1401029863 |1 | 
(9386399,3)  |   49163 |   9475 | 32 | 
1010 |  
  2 |  1 |2 |  0 |||  | 
 | |||  
|  
  3 |   6560 |1 |464 | 1401029863 |  0 |1 | 
(9386399,3)  |   32779 |  10499 | 32 | 
1010 |  
  4 |  5 |2 |  0 |||  | 
 | |||  
|  
  5 |   8008 |1 |184 |  2 |  0 |0 | 
(9386399,5)  |   32779 |  10499 | 32 | 
1010 |  
  6 |   6416 |1 |144 | 1418089052 | 1418089053 |0 | 
(9386399,7)  |   16395 |259 | 32 | 
1010 |  
  7 |   6272 |1 |144 | 1418089053 |  0 |0 | 
(9386399,7)  |   32779 |  10243 | 32 | 
1010 |  
  8 |   6128 |1 |144 | 1418089056 | 1418089060 |0 | 
(9386399,11) |   16395 |   1283 | 32 | 
1010 |  
  9 | 10 |2 |  0 |||  | 
 | |||  
|  
 10 |   7864 |1 |144 |  2 |  0 |0 | 
(9386399,10) |   32779 |  10499 | 32 | 

Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-06 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 On Thu, Feb 6, 2014 at 11:48 PM, Andres Freund and...@2ndquadrant.com wrote:
 That's not necessarily true. If e.g. the buffer mapping would change
 racily, the result write from the bgwriter could very well end up
 increasing the file size, leaving a hole inbetween its write and the
 original size.

 a) the segment isn't sparse and b) there were whole segments full of
 nuls between the end of the tables and the final blocks.

 So the file was definitely extended by Postgres, not the OS and the
 bgwriter passes EXTENSION_FAIL which means it wouldn't create those
 intervening segments.

But ... when InRecovery, md.c will create such segments too.  We had
dismissed that on the grounds that the files would be sparse because
of the way md.c creates them.  However, it is real damn hard to see
how the loop in XLogReadBufferExtended could've accessed a bogus block,
other than hardware misfeasance which I don't believe any more than
you do.  The blkno that's passed to that function came directly out
of a WAL record that's in the private memory of the startup process
and recently passed a CRC check.  You'd have to believe some sort
of asynchronous memory clobber inside the startup process.

On the other hand, if _mdfd_getseg did the deed, there's a whole lot
more space for something funny to have happened, because now we're
talking about a buffer being written in preparation for eviction
from shared buffers, long after WAL replay filled it.

So I'm wondering if there's something wrong with our deduction from
file non-sparseness.  In this connection, google quickly found me
a report of XFS losing the sparse state of a file across multiple
writes:
http://oss.sgi.com/archives/xfs/2011-06/msg00225.html
I wonder whether that bug or a similar one exists in your production
kernel.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-06 Thread Andres Freund
On 2014-02-06 18:42:05 -0500, Tom Lane wrote:
 Greg Stark st...@mit.edu writes:
  On Thu, Feb 6, 2014 at 11:48 PM, Andres Freund and...@2ndquadrant.com 
  wrote:
  That's not necessarily true. If e.g. the buffer mapping would change
  racily, the result write from the bgwriter could very well end up
  increasing the file size, leaving a hole inbetween its write and the
  original size.
 
  a) the segment isn't sparse and b) there were whole segments full of
  nuls between the end of the tables and the final blocks.
 
  So the file was definitely extended by Postgres, not the OS and the
  bgwriter passes EXTENSION_FAIL which means it wouldn't create those
  intervening segments.
 
 But ... when InRecovery, md.c will create such segments too.  We had
 dismissed that on the grounds that the files would be sparse because
 of the way md.c creates them.  However, it is real damn hard to see
 how the loop in XLogReadBufferExtended could've accessed a bogus block,
 other than hardware misfeasance which I don't believe any more than
 you do.  The blkno that's passed to that function came directly out
 of a WAL record that's in the private memory of the startup process
 and recently passed a CRC check.  You'd have to believe some sort
 of asynchronous memory clobber inside the startup process.

That reminds me, not that I directly see how it could be responsible,
there's still 20131029011623.gj20...@awork2.anarazel.de ff. around. I
don't think we came to a agreement in that thread how to fix the
problem.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-06 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 That reminds me, not that I directly see how it could be responsible,
 there's still 20131029011623.gj20...@awork2.anarazel.de ff. around. I
 don't think we came to a agreement in that thread how to fix the
 problem.

Hm, yeah.  I'm not sure I believe Heikki's argument that we need to avoid
closing the smgr relation.  If that stuff is being used in any
performance-critical paths then we've got trouble already.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-02 Thread Greg Stark
I've poked at this a bit more. There are at least 10 relations where
the last block doesn't match the block mentioned in the xlog record
that its LSN indicates. At least it looks like from the info xlogdump
prints.

Including two blocks where the correct block has the same LSN which
maybe means the record was correctlly replayed the second time. Or
perhaps it just means I'm underestimating the complexity of btree
splits.

[cur:EAD/511424E0, xid:1423308342, rmid:11(Btree),
len/tot_len:3750/12786, info:77, prev:EAD/51142490] split_r:
s/d/r:1663/16385/1521566 leftsib 1697585
[cur:EAD/511424E0, xid:1423308342, rmid:11(Btree),
len/tot_len:3750/12786, info:77, prev:EAD/51142490] bkpblock[1]:
s/d/r:1663/16385/1521566 blk:1697585 hole_off/len:576/4288
[cur:EAD/511424E0, xid:1423308342, rmid:11(Btree),
len/tot_len:3750/12786, info:77, prev:EAD/51142490] bkpblock[2]:
s/d/r:1663/16385/1521566 blk:786380 hole_off/len:740/3140

 relfilenode | blockno  | lsn  | tli | flags | lower | upper |
special | pagesize | version | prune_xid
-+--+--+-+---+---+---+-+--+-+
 1521566 |  1697585 | EAD/511456E8 |   6 | 0 |   576 |  4864 |
   8176 | 8192 |   4 |  0
 1521566 |  1704143 | EAD/511456E8 |   6 | 0 |   644 |  4456 |
   8176 | 8192 |   4 |  0

[cur:EAD/520F0EE0, xid:1423309260, rmid:11(Btree),
len/tot_len:4230/4450, info:69, prev:EAD/520F0E98] split_r:
s/d/r:1663/16385/4995658 leftsib 139569
[cur:EAD/520F0EE0, xid:1423309260, rmid:11(Btree),
len/tot_len:4230/4450, info:69, prev:EAD/520F0E98] bkpblock[2]:
s/d/r:1663/16385/4995658 blk:18152 hole_off/len:28/8028
 relfilenode | blockno  | lsn  | tli | flags | lower | upper |
special | pagesize | version | prune_xid
-+--+--+-+---+---+---+-+--+-+
 4995658 |   139569 | EAD/520F2058 |   6 | 0 |   152 |  4336 |
   8176 | 8192 |   4 |  0
 4995658 |   139584 | EAD/520F2058 |   6 | 0 |   164 |  3976 |
   8176 | 8192 |   4 |  0


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-02 Thread Greg Stark
Hm, I'm not entirely convinced those are erroneous replays to wrong
blocks. They don't look right but there are no blocks of NULs
preceding them. So if they're wrong then they only extended the
relations by a single block.

The relfilenodes that have nul blocks before the last block are:

 relfilenode | blockno  | segnum |  offset  | lsn | tli | flags |
lower | upper | special | pagesize | version | prune_xid
-+--++--+-+-+---+---+---+-+--+-+---
 1261982 |  7141472 | 54 |  7141472 | 0/0 |   0 | 0 |
0 | 0 |   0 |0 |   0 | 0
  473158 | 18090059 |138 | 18090059 | 0/0 |   0 | 0 |
0 | 0 |   0 |0 |   0 | 0
 1366221 |  2208159 | 16 |  2208159 | 0/0 |   0 | 0 |
0 | 0 |   0 |0 |   0 | 0
 1364767 |  3631161 | 27 |  3631161 | 0/0 |   0 | 0 |
0 | 0 |   0 |0 |   0 | 0
 1519286 |   311808 |  2 |   311808 | 0/0 |   0 | 0 |
0 | 0 |   0 |0 |   0 | 0


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-02 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 The relfilenodes that have nul blocks before the last block are:

Can we see the associated WAL records (ie, the ones matching the LSNs
in the last blocks of these files)?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-02 Thread Greg Stark
On Sun, Feb 2, 2014 at 6:03 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Greg Stark st...@mit.edu writes:
 The relfilenodes that have nul blocks before the last block are:

 Can we see the associated WAL records (ie, the ones matching the LSNs
 in the last blocks of these files)?

Sorry, I've lost track of what information I already shared or didn't,
i've been staring at these records all day. It would be so much easier
if xlogdump was a fdw or set returning function so I could do joins
with pageinspect data:

This is the information from the final block headers:

 relfilenode | blockno  | segnum |  offset  |lsn | tli | flags
| lower | upper | special | pagesize | version | prune_xid
-+--++--++-+---+---+---+-+--+-+
  473158 | 18090059 |138 | 18090059 | EA1/625E28 |   6 | 0
|   144 |   896 |8192 | 8192 |   4 | 1401029863
 1366221 |  2208159 | 16 |  2208159 | EA1/62DDF0 |   6 | 0
|  1180 |  3552 |8176 | 8192 |   4 |  0
 1261982 |  7141472 | 54 |  7141472 | EA1/638988 |   6 | 0
|  1240 |  3312 |8176 | 8192 |   4 |  0
 1364767 |  3631161 | 27 |  3631161 | EA1/63A0A8 |   6 | 0
|  1180 |  3552 |8176 | 8192 |   4 |  0
 1519286 |   311808 |  2 |   311808 | EA1/708B08 |   6 | 1
|   312 |  3040 |8192 | 8192 |   4 |  0

These are all the records from the tx that corresponds to the lsn in
the first relfilenode:

[cur:EA1/6240C0, xid:1418089146, rmid:10(Heap), len/tot_len:28/7524,
info:72, prev:EA1/624080] hot_update: s/d/r:1663/16385/473158 block
9386399 off 29 to block 9386399 off 30
[cur:EA1/6240C0, xid:1418089146, rmid:10(Heap), len/tot_len:28/7524,
info:72, prev:EA1/624080] bkpblock[1]: s/d/r:1663/16385/473158
blk:9386399 hole_off/len:144/752
[cur:EA1/625E28, xid:1418089146, rmid:1(Transaction),
len/tot_len:32/64, info:0, prev:EA1/6240C0] d/s:16385/1663 commit at
2014-01-21 05:41:11 UTC

The middle three blocks have LSNs that are all part of the same tx,
here are all the records for this tx:

[cur:EA1/625F28, xid:1418089147, rmid:10(Heap), len/tot_len:21/6441,
info:8, prev:EA1/625E68] insert: s/d/r:1663/16385/473158
blk/off:9386398/33 header: none
[cur:EA1/625F28, xid:1418089147, rmid:10(Heap), len/tot_len:21/6441,
info:8, prev:EA1/625E68] bkpblock[1]: s/d/r:1663/16385/473158
blk:9386398 hole_off/len:156/1828
[cur:EA1/627868, xid:1418089147, rmid:11(Btree), len/tot_len:18/8214,
info:8, prev:EA1/625F28] insert_leaf: s/d/r:1663/16385/473176 tid
1319804/405
[cur:EA1/627868, xid:1418089147, rmid:11(Btree), len/tot_len:18/8214,
info:8, prev:EA1/625F28] bkpblock[1]: s/d/r:1663/16385/473176
blk:1319804 hole_off/len:1644/52
[cur:EA1/629898, xid:1418089147, rmid:11(Btree), len/tot_len:18/6494,
info:8, prev:EA1/627868] insert_leaf: s/d/r:1663/16385/473182 tid
1186167/147
[cur:EA1/629898, xid:1418089147, rmid:11(Btree), len/tot_len:18/6494,
info:8, prev:EA1/627868] bkpblock[1]: s/d/r:1663/16385/473182
blk:1186167 hole_off/len:1300/1772
[cur:EA1/62B210, xid:1418089147, rmid:11(Btree), len/tot_len:18/5314,
info:8, prev:EA1/629898] insert_leaf: s/d/r:1663/16385/1270734 tid
1294137/2
[cur:EA1/62B210, xid:1418089147, rmid:11(Btree), len/tot_len:18/5314,
info:8, prev:EA1/629898] bkpblock[1]: s/d/r:1663/16385/1270734
blk:1294137 hole_off/len:1064/2952
[cur:EA1/62C6E8, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894,
info:8, prev:EA1/62B210] insert_leaf: s/d/r:1663/16385/1366221 tid
1364573/66
[cur:EA1/62C6E8, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894,
info:8, prev:EA1/62B210] bkpblock[1]: s/d/r:1663/16385/1366221
blk:1364573 hole_off/len:1180/2372
[cur:EA1/62DDF0, xid:1418089147, rmid:11(Btree), len/tot_len:18/4814,
info:8, prev:EA1/62C6E8] insert_leaf: s/d/r:1663/16385/1404440 tid
1195953/51
[cur:EA1/62DDF0, xid:1418089147, rmid:11(Btree), len/tot_len:18/4814,
info:8, prev:EA1/62C6E8] bkpblock[1]: s/d/r:1663/16385/1404440
blk:1195953 hole_off/len:964/3452
[cur:EA1/62F0D8, xid:1418089147, rmid:11(Btree), len/tot_len:18/6862,
info:8, prev:EA1/62DDF0] insert_leaf: s/d/r:1663/16385/1410405 tid
1894183/39
[cur:EA1/62F0D8, xid:1418089147, rmid:11(Btree), len/tot_len:18/6862,
info:8, prev:EA1/62DDF0] bkpblock[1]: s/d/r:1663/16385/1410405
blk:1894183 hole_off/len:988/1404
[cur:EA1/630BC0, xid:1418089147, rmid:11(Btree), len/tot_len:18/7254,
info:8, prev:EA1/62F0D8] insert_leaf: s/d/r:1663/16385/1521566 tid
1691110/132
[cur:EA1/630BC0, xid:1418089147, rmid:11(Btree), len/tot_len:18/7254,
info:8, prev:EA1/62F0D8] bkpblock[1]: s/d/r:1663/16385/1521566
blk:1691110 hole_off/len:1044/1012
[cur:EA1/632830, xid:1418089147, rmid:11(Btree), len/tot_len:18/5174,
info:8, prev:EA1/630BC0] insert_leaf: s/d/r:1663/16385/5285587 tid
386419/102
[cur:EA1/632830, xid:1418089147, rmid:11(Btree), len/tot_len:18/5174,
info:8, prev:EA1/630BC0] bkpblock[1]: s/d/r:1663/16385/5285587
blk:386419 

Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-02 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 On Sun, Feb 2, 2014 at 6:03 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Can we see the associated WAL records (ie, the ones matching the LSNs
 in the last blocks of these files)?

 Sorry, I've lost track of what information I already shared or didn't,

Hm.  So one of these is a heap update, not an index update, which lets
out the theory that it's something specific to indexes.  But they are
all full-page-image updates, so the WAL replay code path for full-page
images still seems to be the suspect.

What version were you running before 9.1.11 exactly?  I took a look
through all the diffs from 9.1.9 up to 9.1.11, and couldn't find any
changes that seemed even vaguely related to this.  There are some
changes in known-transaction tracking, but it's hard to see a connection
there.  Most of the other diffs are in code that wouldn't execute during
WAL replay at all.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-01 Thread Greg Stark
On Fri, Jan 31, 2014 at 8:21 PM, Tom Lane t...@sss.pgh.pa.us wrote:
  So on a filesystem that supports holes
 in files, I'd expect that the added segments would be fully allocated
 if XLogReadBufferExtended did the deed, but they'd be quite small if
 _mdfd_getseg did so.  The du results you started with suggest that the
 former is the case, but could you verify that the filesystem this is
 on supports holes and that du will report only the actually allocated
 space when there's a hole?

Yup, it's xfs:

# dd if=/dev/zero seek=1M count=1 bs=1kB of=hole
1+0 records in
1+0 records out
1000 bytes (1.0 kB) copied, 5.7286e-05 s, 17.5 MB/s

# ls -ls hole
4 -rw-r--r-- 1 root root 1048577000 Feb  1 09:35 hole

# ls -ls 1261982.5*
1048576 -rw--- 1 ua8157t9mbut8r postgres 1073741824 Jan 14 12:51 1261982.5
1048576 -rw--- 1 ua8157t9mbut8r postgres 1073741824 Jan 23 02:05 1261982.50
1048576 -rw--- 1 ua8157t9mbut8r postgres 1073741824 Jan 23 02:07 1261982.51
1048576 -rw--- 1 ua8157t9mbut8r postgres 1073741824 Jan 23 02:09 1261982.52
1048576 -rw--- 1 ua8157t9mbut8r postgres 1073741824 Jan 23 02:10 1261982.53
 508680 -rw--- 1 ua8157t9mbut8r postgres  520888320 Jan 23 02:12 1261982.54



-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-02-01 Thread Greg Stark
The plot thickens...

Looking at the next relation I see a similar symptom of a single valid
block at the end of several segments of nuls. This relation is also a
btree on the same table and has a header in the near vicinity of the
xlog:

d9de7pcqls4ib6=# select
(page_header(get_raw_page('event_data_event_id_person_id', 'main',
3631161))).* ;
lsn | tli | flags | lower | upper | special | pagesize |
version | prune_xid
+-+---+---+---+-+--+-+---
 EA1/63A0A8 |   6 | 0 |  1180 |  3552 |8176 | 8192 |
4 | 0
(1 row)

And indeed it's the very next xlog record:

[cur:EA1/638988, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894,
info:8, prev:EA1/637140] insert_leaf: s/d/r:1663/16385/1364767 tid
2746914/219
[cur:EA1/638988, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894,
info:8, prev:EA1/637140] bkpblock[1]: s/d/r:1663/16385/1364767
blk:2746914 hole_off/len:1180/2372
[cur:EA1/63A0A8, xid:1418089147, rmid:1(Transaction),
len/tot_len:32/64, info:0, prev:EA1/638988] d/s:16385/1663 commit at
2014-01-21 05:41:11 UTC


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Greg Stark
On Sun, Jan 26, 2014 at 5:45 PM, Andres Freund and...@2ndquadrant.com wrote:

 We're also seeing log entries about wal contains reference to invalid
 pages but these errors seem only vaguely correlated. Sometimes we get
 the errors but the tables don't grow noticeably and sometimes we don't
 get the errors and the tables are much larger.

 Uhm. I am a bit confused. You see those in the standby's log? At !debug
 log levels? That'd imply that the standby is dead and needed to be
 recloned, no? How do you continue after that?


So in chatting with Heikki last night we came up with a scenario where
this check is insufficient.

If you have multiple checkpoints during the base backup then there
will be restartpoints during recovery. If the reference to the invalid
page is before the restartpont then after crashing recovery and coming
back up the recovery will go forward fine.

Fixing this check doesn't look trivial. I'm inclined to say to
suppress any restartpoints while there are references to invalid pages
in the hash. The problem with this is that it will prevent trimming
the xlog during recovery. It seems frightening that most days recovery
will take little extra space but if you happen to have a drop table or
truncate during the base backup then your recovery might require a lot
of extra space.

The alternative of spilling the hash table to disk at every
restartpoint seems kind of hokey. Then we need to worry about fsyncing
this file, cleaning it up, dealing with the file after crashes, etc.

-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Andres Freund
On 2014-01-31 11:09:14 +, Greg Stark wrote:
 On Sun, Jan 26, 2014 at 5:45 PM, Andres Freund and...@2ndquadrant.com wrote:
 
  We're also seeing log entries about wal contains reference to invalid
  pages but these errors seem only vaguely correlated. Sometimes we get
  the errors but the tables don't grow noticeably and sometimes we don't
  get the errors and the tables are much larger.
 
  Uhm. I am a bit confused. You see those in the standby's log? At !debug
  log levels? That'd imply that the standby is dead and needed to be
  recloned, no? How do you continue after that?

 So in chatting with Heikki last night we came up with a scenario where
 this check is insufficient.

But that seems unrelated to the issue at hand, right?

 If you have multiple checkpoints during the base backup then there
 will be restartpoints during recovery. If the reference to the invalid
 page is before the restartpont then after crashing recovery and coming
 back up the recovery will go forward fine.

We don't perform restartpoints if there are invalid pages
registered. Check the XLogHaveInvalidPages() call in xlog.c.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Andres Freund
On 2014-01-31 11:09:14 +, Greg Stark wrote:
 On Sun, Jan 26, 2014 at 5:45 PM, Andres Freund and...@2ndquadrant.com wrote:
 
  We're also seeing log entries about wal contains reference to invalid
  pages but these errors seem only vaguely correlated. Sometimes we get
  the errors but the tables don't grow noticeably and sometimes we don't
  get the errors and the tables are much larger.
 
  Uhm. I am a bit confused. You see those in the standby's log? At !debug
  log levels? That'd imply that the standby is dead and needed to be
  recloned, no? How do you continue after that?

 So in chatting with Heikki last night we came up with a scenario where
 this check is insufficient.

The slightly more likely explanation for transient errors is that you
hit the vacuum bug (061b079f89800929a863a692b952207cadf15886). That had
only taken effect if HS has already assembled a snapshot, which can make
such an error vanish after a restart...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Greg Stark
On Fri, Jan 31, 2014 at 11:26 AM, Andres Freund and...@2ndquadrant.com wrote:
 The slightly more likely explanation for transient errors is that you
 hit the vacuum bug (061b079f89800929a863a692b952207cadf15886). That had
 only taken effect if HS has already assembled a snapshot, which can make
 such an error vanish after a restart...

Which one, there seem to be several

So this seems like it's more likely to be a symptom of whatever is
causing the table to grow than a cause? That is, there's some bug
causing the standby to extend the btree dramatically resulting in lots
of uninitialized pages and touching those pages triggers this bug. But
this doesn't explain why the btree is being extended I don't think.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Andres Freund
On 2014-01-31 11:46:09 +, Greg Stark wrote:
 On Fri, Jan 31, 2014 at 11:26 AM, Andres Freund and...@2ndquadrant.com 
 wrote:
  The slightly more likely explanation for transient errors is that you
  hit the vacuum bug (061b079f89800929a863a692b952207cadf15886). That had
  only taken effect if HS has already assembled a snapshot, which can make
  such an error vanish after a restart...
 
 Which one, there seem to be several
 
 So this seems like it's more likely to be a symptom of whatever is
 causing the table to grow than a cause? That is, there's some bug
 causing the standby to extend the btree dramatically resulting in lots
 of uninitialized pages and touching those pages triggers this bug. But
 this doesn't explain why the btree is being extended I don't think.

I don't think anything we've talked about so far is likely to explain
the issue. I don't have time atm to look closer, but what I'd do is try
to look if there are any pages with valid LSNs on the standby in the
bloated area... That then might give you a hint where to look.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Greg Stark
1261982.53 is entirely nuls. I think that's true for most if not all
of the intervening files, still investigating.

The 54th segment is nul up to offset 1f0c after which it has valid
looking blocks:

# hexdump 1261982.54 | head -100
000        
*
1f0c 0ea1  8988 0063 0006  04d8 0cf0

However when I grep xlogdump for any records mentioning this block I
get nothing.

In fact the largest block I find in the xlog is 3646630:

# grep  'tid 3646631/' 1261982 | wc -l
0
# grep  'tid 3646630/' 1261982 | wc -l
177

Looking at the block above it looks like the LSN is EA100638988
which I find in the logs but it's a btree insert on a different btree:

[cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982
blk:3634978 hole_off/len:1240/2072
[cur:EA1/638988, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894,
info:8, prev:EA1/637140] insert_leaf: s/d/r:1663/16385/1364767 tid
2746914/219
[cur:EA1/638988, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894,
info:8, prev:EA1/637140] bkpblock[1]: s/d/r:1663/16385/1364767
blk:2746914 hole_off/len:1180/2372
[cur:EA1/63A0A8, xid:1418089147, rmid:1(Transaction),
len/tot_len:32/64, info:0, prev:EA1/638988] d/s:16385/1663 commit at
2014-01-21 05:41:11 UTC


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Andres Freund
On 2014-01-31 14:39:47 +, Greg Stark wrote:
 1261982.53 is entirely nuls. I think that's true for most if not all
 of the intervening files, still investigating.
 
 The 54th segment is nul up to offset 1f0c after which it has valid
 looking blocks:

It'd be interesting to dump the page header for that using pageinspect.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Greg Stark
On Fri, Jan 31, 2014 at 2:39 PM, Greg Stark st...@mit.edu wrote:
 [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
 info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982
 blk:3634978 hole_off/len:1240/2072
 [cur:EA1/638988, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894,
 info:8, prev:EA1/637140] insert_leaf: s/d/r:1663/16385/1364767 tid
 2746914/219

Actually wait, the immediate previous record is indeed on the right
filenode. Is the LSN printed in xlogdump the LSN that would be in the
pd_lsn or is the pd_lsn going to be from the following record?


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Andres Freund
On 2014-01-31 14:59:21 +, Greg Stark wrote:
 On Fri, Jan 31, 2014 at 2:39 PM, Greg Stark st...@mit.edu wrote:
  [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
  info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982
  blk:3634978 hole_off/len:1240/2072
  [cur:EA1/638988, xid:1418089147, rmid:11(Btree), len/tot_len:18/5894,
  info:8, prev:EA1/637140] insert_leaf: s/d/r:1663/16385/1364767 tid
  2746914/219
 
 Actually wait, the immediate previous record is indeed on the right
 filenode. Is the LSN printed in xlogdump the LSN that would be in the
 pd_lsn or is the pd_lsn going to be from the following record?

It points to the end of the record (i.e. the beginning of the next). It
needs to, because otherwise XLogFlush()es on the pd_lsn wouldn't flush
enough.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Greg Stark
On Fri, Jan 31, 2014 at 3:08 PM, Andres Freund and...@2ndquadrant.com wrote:

 It points to the end of the record (i.e. the beginning of the next). It
 needs to, because otherwise XLogFlush()es on the pd_lsn wouldn't flush
 enough.

Ah, in which case the relevant record is:
[cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
info:8, prev:EA1/635290] insert_leaf: s/d/r:1663/16385/1261982 tid
3634978/282
[cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982
blk:3634978 hole_off/len:1240/2072


It looks like get_raw_page() refuses to read past the end of relpages.
I could make a clone of this database to allow experimenting with
tweaking relpages but it may or may not reproduce the problem...

=# select pg_relation_size('data_pkey') / 1024 / 1024 / 1024;
 ?column?
--
   23
(1 row)

=# select get_raw_page('data_pkey', 'main', 11073632) ;
ERROR:  block number 11073632 is out of range for relation data_pkey

d9de7pcqls4ib6=# select relpages from pg_class where relname = 'data_pkey';
 relpages
--
  2889286


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Andres Freund
On 2014-01-31 15:15:24 +, Greg Stark wrote:
 On Fri, Jan 31, 2014 at 3:08 PM, Andres Freund and...@2ndquadrant.com wrote:
 
  It points to the end of the record (i.e. the beginning of the next). It
  needs to, because otherwise XLogFlush()es on the pd_lsn wouldn't flush
  enough.
 
 Ah, in which case the relevant record is:
 [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
 info:8, prev:EA1/635290] insert_leaf: s/d/r:1663/16385/1261982 tid
 3634978/282
 [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
 info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982
 blk:3634978 hole_off/len:1240/2072
 
 
 It looks like get_raw_page() refuses to read past the end of relpages.
 I could make a clone of this database to allow experimenting with
 tweaking relpages but it may or may not reproduce the problem...

No, it uses smgrnblocks() to get the size.

 =# select get_raw_page('data_pkey', 'main', 11073632) ;
 ERROR:  block number 11073632 is out of range for relation data_pkey

Isn't the page 3634978?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Greg Stark
On Fri, Jan 31, 2014 at 3:19 PM, Andres Freund and...@2ndquadrant.com wrote:
 =# select get_raw_page('data_pkey', 'main', 11073632) ;
 ERROR:  block number 11073632 is out of range for relation data_pkey

 Isn't the page 3634978?

The page in the record is.

But the page on disk is in the 54th segment at offset 1F0C

So unless my arithmetic is wrong:

bc -l
ibase=16
400 * 400 * 400 / 2000 * 54 + 1F0C / 2000
11073632


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Andres Freund
On 2014-01-31 15:21:35 +, Greg Stark wrote:
 On Fri, Jan 31, 2014 at 3:19 PM, Andres Freund and...@2ndquadrant.com wrote:
  =# select get_raw_page('data_pkey', 'main', 11073632) ;
  ERROR:  block number 11073632 is out of range for relation data_pkey
 
  Isn't the page 3634978?
 
 The page in the record is.

It'd be interesting to look at the referenced page using bt_page_items().

 But the page on disk is in the 54th segment at offset 1F0C

It's interesting that the smgr gets this wrong then (as also evidenced
by the fact that relation_size does as well). Could you please do a ls
-l path/to/relfilenode*?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 It's interesting that the smgr gets this wrong then (as also evidenced
 by the fact that relation_size does as well). Could you please do a ls
 -l path/to/relfilenode*?

IIRC, smgrnblocks will stop as soon as it finds a segment that is not
1GB in size.  Could you check the lengths of all segments for that
relation?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Andres Freund
On 2014-01-31 10:33:16 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  It's interesting that the smgr gets this wrong then (as also evidenced
  by the fact that relation_size does as well). Could you please do a ls
  -l path/to/relfilenode*?
 
 IIRC, smgrnblocks will stop as soon as it finds a segment that is not
 1GB in size.  Could you check the lengths of all segments for that
 relation?

Yea, that's what I am wondering about. I wanted the full list because
there could be an entire file missing and it's interesting to see at
which time they were last touched relative to each other...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 On Fri, Jan 31, 2014 at 3:19 PM, Andres Freund and...@2ndquadrant.com wrote:
 Isn't the page 3634978?

 The page in the record is.

 But the page on disk is in the 54th segment at offset 1F0C

 So unless my arithmetic is wrong:

 bc -l
 ibase=16
 400 * 400 * 400 / 2000 * 54 + 1F0C / 2000
 11073632

At least two of us are confused.  I get

# select ((2^30) * 54.0 + 'x1F0C'::bit(32)::int) / 8192;
 ?column? 
--
  7141472
(1 row)

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Greg Stark
Sorry guys. I transposed two numbers when looking up the relation.
data_pk wasn't the right index.

=# select (page_header(get_raw_page('index_data_id', 'main', 3020854))).* ;
 lsn  | tli | flags | lower | upper | special | pagesize |
version | prune_xid
--+-+---+---+---+-+--+-+---
 CF0/2DD67BB8 |   5 | 0 |   792 |  5104 |8176 | 8192 |
  4 | 0
(1 row)


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 Sorry guys. I transposed two numbers when looking up the relation.
 data_pk wasn't the right index.

 =# select (page_header(get_raw_page('index_data_id', 'main', 3020854))).* ;
  lsn  | tli | flags | lower | upper | special | pagesize |
 version | prune_xid
 --+-+---+---+---+-+--+-+---
  CF0/2DD67BB8 |   5 | 0 |   792 |  5104 |8176 | 8192 |
   4 | 0
 (1 row)

Clearly not the right page --- the LSN isn't what we saw in the hexdump.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Greg Stark
On Fri, Jan 31, 2014 at 3:41 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 400 * 400 * 400 / 2000 * 54 + 1F0C / 2000
 11073632

Ooops, it's reading 54 in hex there.
 # select ((2^30) * 54.0 + 'x1F0C'::bit(32)::int) / 8192;
  ?column?
 --
   7141472

ibase=16
400 * 400 * 400 / 2000 * 36 + 1F0C / 2000
7141472.

So now that I've got the arithmetic right:

=# select (page_header(get_raw_page('index_data_id', 'main', 7141472))).* ;
lsn | tli | flags | lower | upper | special | pagesize |
version | prune_xid
+-+---+---+---+-+--+-+---
 EA1/638988 |   6 | 0 |  1240 |  3312 |8176 | 8192 |
4 | 0


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Andres Freund
On 2014-01-31 10:33:16 -0500, Tom Lane wrote:
 Andres Freund and...@2ndquadrant.com writes:
  It's interesting that the smgr gets this wrong then (as also evidenced
  by the fact that relation_size does as well). Could you please do a ls
  -l path/to/relfilenode*?
 
 IIRC, smgrnblocks will stop as soon as it finds a segment that is not
 1GB in size.  Could you check the lengths of all segments for that
 relation?

I am not sure that explains the issue, but I think the redo action for
truncation is not safe across crashes.  A XLOG_SMGR_TRUNCATE will just
do a smgrtruncate() (and then mdtruncate) which will iterate over the
segments starting at 0 till mdnblocks()/segment_size and *truncate* but
not delete individual segment files that are not needed anymore, right?
If we crash in the midst of that a new mdtruncate() will be issued, but
it will get a shorter value back from mdnblocks().

Am I missing something?

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Greg Stark
So just to summarize, this xlog record:

[cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
info:8, prev:EA1/635290] insert_leaf: s/d/r:1663/16385/1261982 tid
3634978/282
[cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982
blk:3634978 hole_off/len:1240/2072

Appears to have been written to this location:

d9de7pcqls4ib6=# select
(page_header(get_raw_page('index_event_data_on_event_occurrence_id',
'main', 7141472))).* ;
lsn | tli | flags | lower | upper | special | pagesize |
version | prune_xid
+-+---+---+---+-+--+-+---
 EA1/638988 |   6 | 0 |  1240 |  3312 |8176 | 8192 |
4 | 0
(1 row)

The correct location appears to have been written to by later records
(a split where it's the leftsib to be specific) so any forensic
evidence is lost:

d9de7pcqls4ib6=# select
(page_header(get_raw_page('index_event_data_on_event_occurrence_id',
'main', 3634978))).* ;
lsn | tli | flags | lower | upper | special | pagesize |
version | prune_xid
+-+---+---+---+-+--+-+---
 EA1/A90CF8 |   6 | 0 |   792 |  5104 |8176 | 8192 |
4 | 0
(1 row)

But the record prior to that lsn is a split where the leftsib matches
that record:

[cur:EA1/A8EDE0, xid:1418089314, rmid:11(Btree),
len/tot_len:3502/7938, info:68, prev:EA1/A8D7A0] split_r:
s/d/r:1663/16385/1261982 leftsib 3634978
[cur:EA1/A8EDE0, xid:1418089314, rmid:11(Btree),
len/tot_len:3502/7938, info:68, prev:EA1/A8D7A0] bkpblock[2]:
s/d/r:1663/16385/1261982 blk:1881360 hole_off/len:892/3812

Interestingly there are a bunch of other records that also write to
that block but there are no other full page writes. That seems to
imply that the index is currently missing the leaf inserted by the
EA1/637140 record.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 So just to summarize, this xlog record:
 [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
 info:8, prev:EA1/635290] insert_leaf: s/d/r:1663/16385/1261982 tid
 3634978/282
 [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
 info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982
 blk:3634978 hole_off/len:1240/2072

 Appears to have been written to [ block 7141472 ]

I've been staring at the code for a bit trying to guess how that could
have happened.  Since the WAL record has a backup block, btree_xlog_insert
would have passed control to RestoreBackupBlock, which would call
XLogReadBufferExtended with mode RBM_ZERO, so there would be no complaint
about writing past the end of the relation.  Now, you can imagine some
very low-level error causing a write to go to the wrong page due to a seek
problem or some such, but it's hard to credit that that would've resulted
in creation of all the intervening segment files.  Some level of our code
had to have thought it was being told to extend the relation.

However, on closer inspection I was a bit surprised to realize that there
are two possible candidates for doing that!  XLogReadBufferExtended will
extend the relation, a block at a time, if told to write a page past
the current nominal EOF.  And in md.c, _mdfd_getseg will *also* extend
the relation if we're InRecovery, even though it normally would not do
so when called from mdwrite().

Given the behavior in XLogReadBufferExtended, I rather think that the
InRecovery special case in _mdfd_getseg is dead code and should be
removed.  But for the purpose at hand, it's more interesting to try to
confirm which of these code levels did the extension.  I notice that
_mdfd_getseg only bothers to write the last physical page of each segment,
whereas XLogReadBufferExtended knows nothing of segments and will
ploddingly write every page.  So on a filesystem that supports holes
in files, I'd expect that the added segments would be fully allocated
if XLogReadBufferExtended did the deed, but they'd be quite small if
_mdfd_getseg did so.  The du results you started with suggest that the
former is the case, but could you verify that the filesystem this is
on supports holes and that du will report only the actually allocated
space when there's a hole?

Assuming that the extension was done in XLogReadBufferExtended, we are
forced to the conclusion that XLogReadBufferExtended was passed a bad
block number (viz 7141472); and it's pretty hard to see how that could
happen.  RestoreBackupBlock is just passing the value it got out of the
WAL record.  I thought about the idea that it was wrong about exactly
where the BkpBlock struct was in the record, but that would presumably
lead to garbage relnode and fork numbers not just a bad block number.

So I'm still baffled ...

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Greg Stark
One thing I keep coming back to is a bad ran chip setting a bit in the
block number. But I just can't seem to get it to add up. The difference is
not a power of two, it had happened on two different machines, and we don't
see other weirdness on the machine. It seems like a strange coincidence it
would happen to the same variable twice and not to other variables.

Unless there's some unrelated code writing through a wild pointer, possibly
to a stack allocated object that just happens to often be that variable?

-- 
greg
On 31 Jan 2014 20:21, Tom Lane t...@sss.pgh.pa.us wrote:

 Greg Stark st...@mit.edu writes:
  So just to summarize, this xlog record:
  [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
  info:8, prev:EA1/635290] insert_leaf: s/d/r:1663/16385/1261982 tid
  3634978/282
  [cur:EA1/637140, xid:1418089147, rmid:11(Btree), len/tot_len:18/6194,
  info:8, prev:EA1/635290] bkpblock[1]: s/d/r:1663/16385/1261982
  blk:3634978 hole_off/len:1240/2072

  Appears to have been written to [ block 7141472 ]

 I've been staring at the code for a bit trying to guess how that could
 have happened.  Since the WAL record has a backup block, btree_xlog_insert
 would have passed control to RestoreBackupBlock, which would call
 XLogReadBufferExtended with mode RBM_ZERO, so there would be no complaint
 about writing past the end of the relation.  Now, you can imagine some
 very low-level error causing a write to go to the wrong page due to a seek
 problem or some such, but it's hard to credit that that would've resulted
 in creation of all the intervening segment files.  Some level of our code
 had to have thought it was being told to extend the relation.

 However, on closer inspection I was a bit surprised to realize that there
 are two possible candidates for doing that!  XLogReadBufferExtended will
 extend the relation, a block at a time, if told to write a page past
 the current nominal EOF.  And in md.c, _mdfd_getseg will *also* extend
 the relation if we're InRecovery, even though it normally would not do
 so when called from mdwrite().

 Given the behavior in XLogReadBufferExtended, I rather think that the
 InRecovery special case in _mdfd_getseg is dead code and should be
 removed.  But for the purpose at hand, it's more interesting to try to
 confirm which of these code levels did the extension.  I notice that
 _mdfd_getseg only bothers to write the last physical page of each segment,
 whereas XLogReadBufferExtended knows nothing of segments and will
 ploddingly write every page.  So on a filesystem that supports holes
 in files, I'd expect that the added segments would be fully allocated
 if XLogReadBufferExtended did the deed, but they'd be quite small if
 _mdfd_getseg did so.  The du results you started with suggest that the
 former is the case, but could you verify that the filesystem this is
 on supports holes and that du will report only the actually allocated
 space when there's a hole?

 Assuming that the extension was done in XLogReadBufferExtended, we are
 forced to the conclusion that XLogReadBufferExtended was passed a bad
 block number (viz 7141472); and it's pretty hard to see how that could
 happen.  RestoreBackupBlock is just passing the value it got out of the
 WAL record.  I thought about the idea that it was wrong about exactly
 where the BkpBlock struct was in the record, but that would presumably
 lead to garbage relnode and fork numbers not just a bad block number.

 So I'm still baffled ...

 regards, tom lane



Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Tom Lane
Greg Stark st...@mit.edu writes:
 One thing I keep coming back to is a bad ran chip setting a bit in the
 block number. But I just can't seem to get it to add up. The difference is
 not a power of two, it had happened on two different machines, and we don't
 see other weirdness on the machine. It seems like a strange coincidence it
 would happen to the same variable twice and not to other variables.

I also looked at the bit patterns for the two block numbers, and couldn't
detect any relationship.

 Unless there's some unrelated code writing through a wild pointer, possibly
 to a stack allocated object that just happens to often be that variable?

Yeah, I'd been wondering if the WAL record somehow got corrupted while
in memory (presumably after being CRC-checked).  It's a bit hard to see
how though.

Are all the bloated-on-the-slave relations indexes?  I think the most
fruitful thing to do at this point is to try to isolate the bloating
events for the other affected rels as you've done for this one.
Maybe we'll see a pattern.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Josh Berkus
On 01/31/2014 01:11 PM, Tom Lane wrote:
 Greg Stark st...@mit.edu writes:
 One thing I keep coming back to is a bad ran chip setting a bit in the
 block number. But I just can't seem to get it to add up. The difference is
 not a power of two, it had happened on two different machines, and we don't
 see other weirdness on the machine. It seems like a strange coincidence it
 would happen to the same variable twice and not to other variables.
 
 I also looked at the bit patterns for the two block numbers, and couldn't
 detect any relationship.
 
 Unless there's some unrelated code writing through a wild pointer, possibly
 to a stack allocated object that just happens to often be that variable?
 
 Yeah, I'd been wondering if the WAL record somehow got corrupted while
 in memory (presumably after being CRC-checked).  It's a bit hard to see
 how though.
 
 Are all the bloated-on-the-slave relations indexes?  I think the most
 fruitful thing to do at this point is to try to isolate the bloating
 events for the other affected rels as you've done for this one.
 Maybe we'll see a pattern.

FWIW, we've periodically seen reports from our clients of replica
databases being slightly larger than the master.  Nothing reproducable
or as severe as Greg's issue, or we'd have reported it.  But this could
be a more widespread issue, just that it affects most users in the +5%
ballpark, so they don't notice.

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Greg Stark
On Fri, Jan 31, 2014 at 10:11 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Yeah, I'd been wondering if the WAL record somehow got corrupted while
 in memory (presumably after being CRC-checked).  It's a bit hard to see
 how though.

One thing I mentioned early on but bears repeating is that this
instance is 9.1.11.

Also something that occurred to me at 3am -- the reference to invalid
pages recovery errors that replayed correctly after the panic might
also explain why the slave seems to operate correctly. It's possible
after the panic it replayed those same records correctly.

 Are all the bloated-on-the-slave relations indexes?  I think the most
 fruitful thing to do at this point is to try to isolate the bloating
 events for the other affected rels as you've done for this one.
 Maybe we'll see a pattern.

I'll poke at those others tomorrow/today. I can also try to bring up a
new standby from the same base backup but it'll take time. It's a
large database. Also the fear I have above is that if I set a recovery
target I might make it miss the bug.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-31 Thread Tom Lane
Josh Berkus j...@agliodbs.com writes:
 FWIW, we've periodically seen reports from our clients of replica
 databases being slightly larger than the master.  Nothing reproducable
 or as severe as Greg's issue, or we'd have reported it.  But this could
 be a more widespread issue, just that it affects most users in the +5%
 ballpark, so they don't notice.

Well, the bloat aspect of it is really the least of our problems.
AFAICT what is happening is flat-out index corruption, that is, an
intended update isn't applied where it should be but to some
seemingly-random other page.  If this is common then it seems like we
ought to be hearing a lot of reports of query misbehavior on slaves.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-26 Thread Andres Freund
Hi,

On 2014-01-24 19:23:28 -0500, Greg Stark wrote:
 Since the point release we've run into a number of databases that when
 we restore from a base backup end up being larger than the primary
 database was. Sometimes by a large factor. The data below is from
 9.1.11 (both primary and standby) but we've seen the same thing on
 9.2.6.

What's the procedure for creating those standbys? Were they of similar
size after being cloned?

 primary$ for i in  1261982 1364767 1366221 473158 ; do echo -n $i  ;
 du -shc $i* | tail -1 ; done
 1261982 29G total
 1364767 23G total
 1366221 12G total
 473158 76G total
 
 standby$ for i in  1261982 1364767 1366221 473158 ; do echo -n $i  ;
 du -shc $i* | tail -1 ; done
 1261982 55G total
 1364767 28G total
 1366221 17G total
 473158 139G total
 ...
 The first three are btrees and the fourth is a haeap btw.

Are those all of the same underlying heap relation?

 We're also seeing log entries about wal contains reference to invalid
 pages but these errors seem only vaguely correlated. Sometimes we get
 the errors but the tables don't grow noticeably and sometimes we don't
 get the errors and the tables are much larger.

Uhm. I am a bit confused. You see those in the standby's log? At !debug
log levels? That'd imply that the standby is dead and needed to be
recloned, no? How do you continue after that?

 Much of the added space is uninitialized pages as you might expect but
 I don't understand is how the database can start up without running
 into the reference to invalid pages panic consistently. We check
 both that there are no references after consistency is reached *and*
 that any references before consistency are resolved by a truncate or
 unlink before consistency.

Well, it's pretty easy to get into a situation with lot's of new
pages. Lots of concurrent inserts that all fail before logging WAL. The
next insert will extend the relation and only initialise that last
value.

It'd be interesting to look for TRUNCATE records using xlogdump. Could
you show those for starters?

 I'm assuming this is somehow related to the mulixact or transaction
 wraparound problems but I don't really understand how they could be
 hitting when both the primary and standby are post-upgrade to the most
 recent point release which have the fixes

That doesn't sound likely. For one the symptoms don't fit, for another,
those problems are mostly 9.3+.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Recovery inconsistencies, standby much larger than primary

2014-01-26 Thread Greg Stark
On Sun, Jan 26, 2014 at 9:45 AM, Andres Freund and...@2ndquadrant.com wrote:
 Hi,

 On 2014-01-24 19:23:28 -0500, Greg Stark wrote:
 Since the point release we've run into a number of databases that when
 we restore from a base backup end up being larger than the primary
 database was. Sometimes by a large factor. The data below is from
 9.1.11 (both primary and standby) but we've seen the same thing on
 9.2.6.

 What's the procedure for creating those standbys? Were they of similar
 size after being cloned?

These are restored from base backup using WAL-E and then started in
standby mode. The logs are retrieved using archive_command (which is
WAL-E) after it retrieves lots of archived wal the database switches
to streaming.

We confirmed from size monitoring that the standby database grew
substantially before the time it reported reaching consistent state,
so I only downloaded the WAL from that range for analysis.

 primary$ for i in  1261982 1364767 1366221 473158 ; do echo -n $i  ;
 du -shc $i* | tail -1 ; done
 1261982 29G total
 1364767 23G total
 1366221 12G total
 473158 76G total

 standby$ for i in  1261982 1364767 1366221 473158 ; do echo -n $i  ;
 du -shc $i* | tail -1 ; done
 1261982 55G total
 1364767 28G total
 1366221 17G total
 473158 139G total
 ...
 The first three are btrees and the fourth is a haeap btw.

 Are those all of the same underlying heap relation?

Are you asking whether the relfilenode was reused for a different
relation? I doubt it.

Or are you asking if the first three indexes are for the same heap
(presumably the fourth one)? I don't think so but I can check.

 We're also seeing log entries about wal contains reference to invalid
 pages but these errors seem only vaguely correlated. Sometimes we get
 the errors but the tables don't grow noticeably and sometimes we don't
 get the errors and the tables are much larger.

 Uhm. I am a bit confused. You see those in the standby's log? At !debug
 log levels? That'd imply that the standby is dead and needed to be
 recloned, no? How do you continue after that?

It's possible I'm confusing symptoms from an unrelated problem. But
the symptom we saw was that it got this error, recovery crashed, then
recovery started again and it replayed fine. I agree that doesn't jive
with the code I see in 9.3, I didn't check how long the code was this
tense though.

 Much of the added space is uninitialized pages as you might expect but
 I don't understand is how the database can start up without running
 into the reference to invalid pages panic consistently. We check
 both that there are no references after consistency is reached *and*
 that any references before consistency are resolved by a truncate or
 unlink before consistency.

 Well, it's pretty easy to get into a situation with lot's of new
 pages. Lots of concurrent inserts that all fail before logging WAL. The
 next insert will extend the relation and only initialise that last
 value.

 It'd be interesting to look for TRUNCATE records using xlogdump. Could
 you show those for starters?

There are no records matching grep -i truncate in any of those
extracts for those relfilenodes. I'm grepping the whole xlogdump now
but it'll take a while. So far no truncates anywhere.

 I'm assuming this is somehow related to the mulixact or transaction
 wraparound problems but I don't really understand how they could be
 hitting when both the primary and standby are post-upgrade to the most
 recent point release which have the fixes

 That doesn't sound likely. For one the symptoms don't fit, for another,
 those problems are mostly 9.3+.

These problems all started to appear after the latest point release
btw. That could just be a coincidence of course.


-- 
greg


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers