subject:"\[HACKERS\] Re\: \[GENERAL\] 9.4.1 \-> 9.4.2 problem\: could not access status of transaction 1"

On Thu, Jun 4, 2015 at 5:29 PM, Robert Haas robertmh...@gmail.com wrote:
 - Forces aggressive autovacuuming when the control file's
 oldestMultiXid doesn't point to a valid MultiXact and enables member
 wraparound at the next checkpoint following the correction of that
 problem.

Err, enables member wraparound *protection* at the next checkpoint,
not the wraparound itself.

It's worth noting that every startup will now include one of the
following two messages:

LOG:  MultiXact member wraparound protections are now enabled

Or:

LOG:  MultiXact member wraparound protections are disabled because
oldest checkpointed MultiXact %u does not exist on disk
...where %u is probably 1

If you get the second one, you'll get the first one later after vacuum
has done its thing and a checkpoint has happened.

This is, obviously, some log chatter for people who don't have a
problem and never have, but I think it's worth emitting the first
message at startup even when there's no problem, so that people don't
have to make inferences from the absence of a message.  We can tell
people very simply that (1) if they see the first message, everything
is fine; (2) if they see the second message, autovacuum is going to
clean things up and they will eventually see the first message; and
(3) if they see neither message, they haven't upgraded to a fixed
version yet.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On Thu, Jun 4, 2015 at 12:57 PM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, Jun 4, 2015 at 9:42 AM, Robert Haas robertmh...@gmail.com wrote:
 Thanks for the review.

 Here's a new version.  I've fixed the things Alvaro and Noah noted,
 and some compiler warnings about set but unused variables.

 I also tested it, and it doesn't quite work as hoped.  If started on a
 cluster where oldestMultiXid is incorrectly set to 1, it starts up and
 indicates that the member wraparound guards are disabled.  But even
 after everything is fixed, they don't get enabled until after the next
 full restart.  I think that's because TruncateMultiXact() bails out
 too early, without calling DetermineSafeOldestOffset.

 My attempt at a quick fix for that problem didn't work out, so I'm
 posting this version for now to facilitate further review and testing.

Here's a new version with some more fixes and improvements:

- SetOffsetVacuumLimit was failing to set MultiXactState-oldestOffset
when the oldest offset became known if the now-known value happened to
be zero.  Fixed.

- SetOffsetVacuumLimit now logs useful information at the DEBUG1
level, so that you can see that it's doing what it's supposed to.

- TruncateMultiXact now calls DetermineSafeOldestOffset to adjust the
offsetStopLimit even if it can't truncate anything.  This seems
useless, but it's not, because it may be that the last checkpoint
advanced lastCheckpointedOldest from a bogus value (i.e. 1) to a real
value, and now we can actually set offsetStopLimit properly.

- TruncateMultiXact no longer calls find_multixact_start when there
are no remaining multixacts.  This is actually a completely separate
bug that goes all the way back to 9.3.0 and can potentially cause
TruncateMultiXact to remove every file in pg_multixact/offsets.
Restarting the cluster becomes impossible because TrimMultiXact barfs.

- TruncateMultiXact now logs a message if the oldest multixact does
not precede the earliest one on disk and is not equal to the next
multixact and yet does not exist.  The value of the log message is
that it discovered the bug mentioned in the previous line, so I think
it's earning its keep.

With this version, I'm able to see that when you start up a
9.3.latest+this patch with a cluster that has a bogus value of 1 in
relminmxid, datminmxid, and the control file, autovacuum vacuums
everything in sight, all the values get set back to the right thing,
and the next checkpoint enables the member-wraparound guards.  This
works with both autovacuum=on and autovacuum=off; the emergency
mechanism kicks in as intended.  We'll want to warn people with big
databases who upgrade to 9.3.0 - 9.3.4 via pg_upgrade that they may
want to pre-vacuum those tables before upgrading to avoid a vacuum
storm.  But generally I'm pretty happy with this: forcing those values
to get fixed so that we can guard against member-space wraparound
seems like the right thing to do.

So, to summarize, this patch does the following:

- Fixes the failure-to-start problems introduced in 9.4.2 in
complicated pg_upgrade scenarios.
- Prevents the new calls to find_multixact_start we added in 9.4.2
from happening during recovery, where they can only create failure
scenarios.  The call in TruncateMultiXact that has been there all
along is not eliminated, but now handles failure more gracefully.
- Fixes possible incorrect removal of every single
pg_multixact/offsets file when no multixacts exist; one file should be
kept.
- Forces aggressive autovacuuming when the control file's
oldestMultiXid doesn't point to a valid MultiXact and enables member
wraparound at the next checkpoint following the correction of that
problem.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
commit 87aa15fe5257060e0c971e135dd9f460fdc00bd0
Author: Robert Haas rhaas@postgresql.org
Date:   Thu Jun 4 11:58:49 2015 -0400

bar

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9568ff1..7c457a6 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -198,13 +198,24 @@ typedef struct MultiXactStateData
 	/* next-to-be-assigned offset */
 	MultiXactOffset nextOffset;
 
+	/* Have we completed multixact startup? */
+	bool		finishedStartup;
+
 	/*
-	 * Oldest multixact that is still on disk.  Anything older than this
-	 * should not be consulted.  These values are updated by vacuum.
+	 * Oldest multixact that is still potentially referenced by a relation.
+	 * Anything older than this should not be consulted.  These values are
+	 * updated by vacuum.
 	 */
 	MultiXactId oldestMultiXactId;
 	Oid			oldestMultiXactDB;
+
+	/*
+	 * Oldest multixact offset that is potentially referenced by a
+	 * multixact referenced by a relation.  We don't always know this value,
+	 * so there's a flag here to indicate whether or not we currently do.
+	 */
 	MultiXactOffset oldestOffset;
+	bool		oldestOffsetKnown;

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Thomas Munro

On Fri, Jun 5, 2015 at 9:29 AM, Robert Haas robertmh...@gmail.com wrote:
Here's a new version with some more fixes and improvements:

- SetOffsetVacuumLimit was failing to set MultiXactState-oldestOffset
when the oldest offset became known if the now-known value happened to
be zero. Fixed.

- SetOffsetVacuumLimit now logs useful information at the DEBUG1
level, so that you can see that it's doing what it's supposed to.

So, to summarize, this patch does the following:

LOG: could not truncate directory pg_multixact/offsets: apparent wraparound

That message comes from SimpleLruTruncate.

--
Thomas Munro
http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Andres Freund

Hi,

On 2015-06-04 12:57:42 -0400, Robert Haas wrote:
 + /*
 +  * Do we need an emergency autovacuum?  If we're not sure, assume yes.
 +  */
 + return !oldestOffsetKnown ||
 + (nextOffset - oldestOffset  MULTIXACT_MEMBER_SAFE_THRESHOLD);

I think without teaching autovac about those rules, this might just lead
to lots of autovac processes starting without knowing they should do
something? They know about autovacuum_multixact_freeze_age, but they
know neither about !oldestOffsetKnown nor, afaics, about pending offset
wraparounds?

 -static MultiXactOffset
 -find_multixact_start(MultiXactId multi)
 +static bool
 +find_multixact_start(MultiXactId multi, MultiXactOffset *result)
  {
   MultiXactOffset offset;
   int pageno;
 @@ -2630,6 +2741,9 @@ find_multixact_start(MultiXactId multi)
   pageno = MultiXactIdToOffsetPage(multi);
   entryno = MultiXactIdToOffsetEntry(multi);
  
 + if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
 + return false;
 +
   /* lock is acquired by SimpleLruReadPage_ReadOnly */
   slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi);
   offptr = (MultiXactOffset *) 
 MultiXactOffsetCtl-shared-page_buffer[slotno];
 @@ -2642,25 +2756,31 @@ find_multixact_start(MultiXactId multi)


I think it'd be a good idea to also return false in case of a
InvalidMultiXactId - that'll be returned if the page has been zeroed.


Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On Thu, Jun 4, 2015 at 1:27 PM, Andres Freund and...@anarazel.de wrote:
 On 2015-06-04 12:57:42 -0400, Robert Haas wrote:
 + /*
 +  * Do we need an emergency autovacuum?  If we're not sure, assume yes.
 +  */
 + return !oldestOffsetKnown ||
 + (nextOffset - oldestOffset  MULTIXACT_MEMBER_SAFE_THRESHOLD);

 I think without teaching autovac about those rules, this might just lead
 to lots of autovac processes starting without knowing they should do
 something? They know about autovacuum_multixact_freeze_age, but they
 know neither about !oldestOffsetKnown nor, afaics, about pending offset
 wraparounds?

You're right, but that's why the latest patch has changes in
MultiXactMemberFreezeThreshold.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Alvaro Herrera

Alvaro Herrera wrote:
 Robert Haas wrote:
 
  So here's a patch taking a different approach.
 
 I tried to apply this to 9.3 but it's messy because of pgindent.  Anyone
 would have a problem with me backpatching a pgindent run of multixact.c?

Done.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On Thu, Jun 4, 2015 at 9:42 AM, Robert Haas robertmh...@gmail.com wrote:
 Thanks for the review.

Here's a new version.  I've fixed the things Alvaro and Noah noted,
and some compiler warnings about set but unused variables.

I also tested it, and it doesn't quite work as hoped.  If started on a
cluster where oldestMultiXid is incorrectly set to 1, it starts up and
indicates that the member wraparound guards are disabled.  But even
after everything is fixed, they don't get enabled until after the next
full restart.  I think that's because TruncateMultiXact() bails out
too early, without calling DetermineSafeOldestOffset.

My attempt at a quick fix for that problem didn't work out, so I'm
posting this version for now to facilitate further review and testing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
commit eb39cf10e4ff853ed4b9d3fab599cf42911e6f70
Author: Robert Haas rhaas@postgresql.org
Date:   Thu Jun 4 11:58:49 2015 -0400

bar

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 699497c..209d3e6 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -196,13 +196,24 @@ typedef struct MultiXactStateData
 	/* next-to-be-assigned offset */
 	MultiXactOffset nextOffset;
 
+	/* Have we completed multixact startup? */
+	bool		finishedStartup;
+
 	/*
-	 * Oldest multixact that is still on disk.  Anything older than this
-	 * should not be consulted.  These values are updated by vacuum.
+	 * Oldest multixact that is still potentially referenced by a relation.
+	 * Anything older than this should not be consulted.  These values are
+	 * updated by vacuum.
 	 */
 	MultiXactId oldestMultiXactId;
 	Oid			oldestMultiXactDB;
+
+	/*
+	 * Oldest multixact offset that is potentially referenced by a
+	 * multixact referenced by a relation.  We don't always know this value,
+	 * so there's a flag here to indicate whether or not we currently do.
+	 */
 	MultiXactOffset oldestOffset;
+	bool		oldestOffsetKnown;
 
 	/*
 	 * This is what the previous checkpoint stored as the truncate position.
@@ -219,6 +230,7 @@ typedef struct MultiXactStateData
 
 	/* support for members anti-wraparound measures */
 	MultiXactOffset offsetStopLimit;
+	bool offsetStopLimitKnown;
 
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
@@ -348,10 +360,11 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 		MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static void DetermineSafeOldestOffset(MultiXactId oldestMXact);
+static void DetermineSafeOldestOffset(MultiXactOffset oldestMXact);
 static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
 		 MultiXactOffset start, uint32 distance);
-static MultiXactOffset find_multixact_start(MultiXactId multi);
+static bool SetOffsetVacuumLimit(bool finish_setup);
+static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMZeroPageXlogRec(int pageno, uint8 info);
 
 
@@ -960,7 +973,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	 * against catastrophic data loss due to multixact wraparound.  The basic
 	 * rules are:
 	 *
-	 * If we're past multiVacLimit or the safe threshold for member storage space,
+	 * If we're past multiVacLimit or the safe threshold for member storage
+	 * space, or we don't know what the safe threshold for member storage is,
 	 * start trying to force autovacuum cycles.
 	 * If we're past multiWarnLimit, start issuing warnings.
 	 * If we're past multiStopLimit, refuse to create new MultiXactIds.
@@ -969,6 +983,7 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	 *--
 	 */
 	if (!MultiXactIdPrecedes(result, MultiXactState-multiVacLimit) ||
+		!MultiXactState-oldestOffsetKnown ||
 		(MultiXactState-nextOffset - MultiXactState-oldestOffset
 			 MULTIXACT_MEMBER_SAFE_THRESHOLD))
 	{
@@ -1083,7 +1098,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 	 *--
 	 */
 #define OFFSET_WARN_SEGMENTS	20
-	if (MultiXactOffsetWouldWrap(MultiXactState-offsetStopLimit, nextOffset,
+	if (MultiXactState-offsetStopLimitKnown 
+		MultiXactOffsetWouldWrap(MultiXactState-offsetStopLimit, nextOffset,
  nmembers))
 		ereport(ERROR,
 (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
@@ -1095,7 +,8 @@ GetNewMultiXactId(int nmembers, MultiXactOffset *offset)
 		   MultiXactState-offsetStopLimit - nextOffset - 1),
  errhint(Execute a database-wide VACUUM in database with OID %u with reduced vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age settings.,
 		 MultiXactState-oldestMultiXactDB)));
-	else if (MultiXactOffsetWouldWrap(MultiXactState-offsetStopLimit,
+	else if (MultiXactState-offsetStopLimitKnown 
+			 MultiXactOffsetWouldWrap(MultiXactState-offsetStopLimit,

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Noah Misch

On Wed, Jun 03, 2015 at 04:53:46PM -0400, Robert Haas wrote:
 So here's a patch taking a different approach.  In this approach, if
 the multixact whose members we want to look up doesn't exist, we don't
 use a later one (that might or might not be valid).  Instead, we
 attempt to cope with the unknown.  That means:
 
 1. In TruncateMultiXact(), we don't truncate.

I like that change a lot.  It's much easier to seek forgiveness for wasting =
28 GiB of disk than for deleting visibility information wrongly.

 2. If setting the offset stop limit (the point where we refuse to
 create new multixact space), we don't arm the stop point.  This means
 that if you're in this situation, you run without member wraparound
 protection until it's corrected.  A message gets logged once per
 checkpoint telling you that you have this problem, and another message
 gets logged when things get straightened out and the guards are
 enabled.
 
 3. If setting the vacuum force point, we assume that it's appropriate
 to immediately force vacuum.

Those seem reasonable, too.

 I've only tested this very lightly - this is just to see what you and
 Noah and others think of the approach.  As compared with the previous
 approach, it has the advantage of making minimal assumptions about the
 sanity of what's on disk.  It has the disadvantage that, for some
 people, the member-wraparound guard won't be enabled at startup -- but
 note that those people can't start 9.3.7/9.4.2 *at all*, so currently
 they are either running without member wraparound protection anyway
 (if they haven't upgraded to those releases) or they're down entirely.

That disadvantage is negligible, considering.

 Another disadvantage is that we'll be triggering what may be quite a
 bit of autovacuum activity for some people, which could be painful.
 On the plus side, they'll hopefully end up with sane relminmxid and
 datminmxid guards afterwards.

That sounds good so long as each table requires just one successful emergency
autovacuum.  I'm not seeing code to ensure that the launched autovacuum will
indeed perform a full-table scan and update relminmxid; is it there?

For sites that can't tolerate an autovacuum storm, what alternative can we
provide?  Is SET vacuum_multixact_freeze_table_age = 0; VACUUM table of
every table, done before applying the minor update, sufficient?

  static void
 -DetermineSafeOldestOffset(MultiXactId oldestMXact)
 +DetermineSafeOldestOffset(MultiXactOffset oldestMXact)

Leftover change from an earlier iteration?  The values passed continue to be
MultiXactId values.

   /* move back to start of the corresponding segment */
 - oldestOffset -= oldestOffset %
 - (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
 + offsetStopLimit = oldestOffset - (oldestOffset %
 + (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
 + /* always leave one segment before the wraparound point */
 + offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * 
 SLRU_PAGES_PER_SEGMENT);
 +
 + /* if nothing has changed, we're done */
 + if (prevOffsetStopLimitKnown  offsetStopLimit == prevOffsetStopLimit)
 + return;
  
   LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 - /* always leave one segment before the wraparound point */
 - MultiXactState-offsetStopLimit = oldestOffset -
 - (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
 + MultiXactState-offsetStopLimit = oldestOffset;

That last line needs s/oldestOffset/offsetStopLimit/, I presume.

Thanks,
nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On Thu, Jun 4, 2015 at 2:42 AM, Noah Misch n...@leadboat.com wrote:
 I like that change a lot.  It's much easier to seek forgiveness for wasting =
 28 GiB of disk than for deleting visibility information wrongly.

I'm glad you like it.  I concur.

 2. If setting the offset stop limit (the point where we refuse to
 create new multixact space), we don't arm the stop point.  This means
 that if you're in this situation, you run without member wraparound
 protection until it's corrected.  A message gets logged once per
 checkpoint telling you that you have this problem, and another message
 gets logged when things get straightened out and the guards are
 enabled.

 3. If setting the vacuum force point, we assume that it's appropriate
 to immediately force vacuum.

 Those seem reasonable, too.

Cool.

 I've only tested this very lightly - this is just to see what you and
 Noah and others think of the approach.  As compared with the previous
 approach, it has the advantage of making minimal assumptions about the
 sanity of what's on disk.  It has the disadvantage that, for some
 people, the member-wraparound guard won't be enabled at startup -- but
 note that those people can't start 9.3.7/9.4.2 *at all*, so currently
 they are either running without member wraparound protection anyway
 (if they haven't upgraded to those releases) or they're down entirely.

 That disadvantage is negligible, considering.

All right.

 Another disadvantage is that we'll be triggering what may be quite a
 bit of autovacuum activity for some people, which could be painful.
 On the plus side, they'll hopefully end up with sane relminmxid and
 datminmxid guards afterwards.

 That sounds good so long as each table requires just one successful emergency
 autovacuum.  I'm not seeing code to ensure that the launched autovacuum will
 indeed perform a full-table scan and update relminmxid; is it there?

No.  Oops.

 For sites that can't tolerate an autovacuum storm, what alternative can we
 provide?  Is SET vacuum_multixact_freeze_table_age = 0; VACUUM table of
 every table, done before applying the minor update, sufficient?

I don't know.  In practical terms, they probably need to ensure that
if pg_multixact/offsets/ does not exist, no relations have
relminmxid = 1 and no remaining databases have datminmxid = 1.
Exactly what it will take to get there is possibly dependent on which
minor release you are running; on current minor releases, I am hopeful
that what you propose is sufficient.

  static void
 -DetermineSafeOldestOffset(MultiXactId oldestMXact)
 +DetermineSafeOldestOffset(MultiXactOffset oldestMXact)

 Leftover change from an earlier iteration?  The values passed continue to be
 MultiXactId values.

Oopsie.

   /* move back to start of the corresponding segment */
 - oldestOffset -= oldestOffset %
 - (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
 + offsetStopLimit = oldestOffset - (oldestOffset %
 + (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT));
 + /* always leave one segment before the wraparound point */
 + offsetStopLimit -= (MULTIXACT_MEMBERS_PER_PAGE * 
 SLRU_PAGES_PER_SEGMENT);
 +
 + /* if nothing has changed, we're done */
 + if (prevOffsetStopLimitKnown  offsetStopLimit == prevOffsetStopLimit)
 + return;

   LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 - /* always leave one segment before the wraparound point */
 - MultiXactState-offsetStopLimit = oldestOffset -
 - (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT);
 + MultiXactState-offsetStopLimit = oldestOffset;

 That last line needs s/oldestOffset/offsetStopLimit/, I presume.

Another oops.

Thanks for the review.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

Andres Freund wrote:
 On 2015-06-03 15:01:46 -0300, Alvaro Herrera wrote:

  One idea I had was: what if the oldestMulti pointed to another multi
  earlier in the same 0046 file, so that it is read-as-zeroes (and the
  file is created), and then a subsequent multixact truncate tries to read
  a later page in the file.  In SlruPhysicalReadPage() this would give a
  change for open() to not fail, and then read() can fail as above.
  However, we would have an earlier LOG message about reading as zeroes.
  
  Really, the whole question of how this code goes past the open() failure
  in SlruPhysicalReadPage baffles me.  I don't see any possible way for
  the file to be created ...
 
 Wouldn't a previous WAL record zeroing another part of that segment
 explain this? A zero sized segment pretty much would lead to this error,
 right? Or were you able to check how things look after the failure?

But why would there be a previous WAL record zeroing another part of
that segment?  Note that this segment is very old -- hasn't been written
in quite a while, it's certainly not in slru buffers anymore.

  2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode
  2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: restored log file 
  000173DD00AD from archive
  2015-05-27 16:15:18 UTC [4782]: [5-1] user=,db= FATAL: could not access 
  status of transaction 4624559
  2015-05-27 16:15:18 UTC [4782]: [6-1] user=,db= DETAIL: Could not read from 
  file pg_multixact/offsets/0046 at offset 147456: Success.
  2015-05-27 16:15:18 UTC [4778]: [4-1] user=,db= LOG: startup process (PID 
  4782) exited with exit code 1
  2015-05-27 16:15:18 UTC [4778]: [5-1] user=,db= LOG: aborting startup due 
  to startup process failure
 
 From this isn't not entirely clear where this error was triggered from.

Well, reading code, it seems reasonable that to assume that replay of
the checkpoint record I mentioned leads to that error message when the
file exists but is not long enough to contain the given offset.  There
are not MultiXact wal records in the segment.  Also note that there's no
other restored log file message after the entering standby mode
message.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Robert Haas

On Wed, Jun 3, 2015 at 8:24 AM, Robert Haas robertmh...@gmail.com wrote:
 On Tue, Jun 2, 2015 at 5:22 PM, Andres Freund and...@anarazel.de wrote:
  Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning
  the disk it'll always get one at a segment boundary, right? I'm not sure
  that's actually ok; because the value at the beginning of the segment
  can very well end up being a 0, as MaybeExtendOffsetSlru() will have
  filled the page with zeros.
 
  I think that should be harmless, the worst that can happen is that
  oldestOffset errorneously is 0, which should be correct, even though
  possibly overly conservative, in these cases.

 Uh oh.  That seems like a real bad problem for this approach.  What
 keeps that from being the opposite of too conservative?  There's no
 safe value in a circular numbering space.

 I think it *might* (I'm really jetlagged) be fine because that'll only
 happen after a upgrade from  9.3. And in that case we initialize
 nextOffset to 0. That ought to safe us?

 That's pretty much betting the farm on the bugs we know about today
 being the only ones there are.  That seems imprudent.

So here's a patch taking a different approach.  In this approach, if
the multixact whose members we want to look up doesn't exist, we don't
use a later one (that might or might not be valid).  Instead, we
attempt to cope with the unknown.  That means:

1. In TruncateMultiXact(), we don't truncate.

2. If setting the offset stop limit (the point where we refuse to
create new multixact space), we don't arm the stop point.  This means
that if you're in this situation, you run without member wraparound
protection until it's corrected.  A message gets logged once per
checkpoint telling you that you have this problem, and another message
gets logged when things get straightened out and the guards are
enabled.

3. If setting the vacuum force point, we assume that it's appropriate
to immediately force vacuum.

I've only tested this very lightly - this is just to see what you and
Noah and others think of the approach.  As compared with the previous
approach, it has the advantage of making minimal assumptions about the
sanity of what's on disk.  It has the disadvantage that, for some
people, the member-wraparound guard won't be enabled at startup -- but
note that those people can't start 9.3.7/9.4.2 *at all*, so currently
they are either running without member wraparound protection anyway
(if they haven't upgraded to those releases) or they're down entirely.
Another disadvantage is that we'll be triggering what may be quite a
bit of autovacuum activity for some people, which could be painful.
On the plus side, they'll hopefully end up with sane relminmxid and
datminmxid guards afterwards.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9568ff1..4400fc5 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -198,13 +198,24 @@ typedef struct MultiXactStateData
 	/* next-to-be-assigned offset */
 	MultiXactOffset nextOffset;
 
+	/* Have we completed multixact startup? */
+	bool		finishedStartup;
+
 	/*
-	 * Oldest multixact that is still on disk.  Anything older than this
-	 * should not be consulted.  These values are updated by vacuum.
+	 * Oldest multixact that is still potentially referenced by a relation.
+	 * Anything older than this should not be consulted.  These values are
+	 * updated by vacuum.
 	 */
 	MultiXactId oldestMultiXactId;
 	Oid			oldestMultiXactDB;
+
+	/*
+	 * Oldest multixact offset that is potentially referenced by a
+	 * multixact referenced by a relation.  We don't always know this value,
+	 * so there's a flag here to indicate whether or not we currently do.
+	 */
 	MultiXactOffset oldestOffset;
+	bool		oldestOffsetKnown;
 
 	/*
 	 * This is what the previous checkpoint stored as the truncate position.
@@ -221,6 +232,7 @@ typedef struct MultiXactStateData
 
 	/* support for members anti-wraparound measures */
 	MultiXactOffset offsetStopLimit;
+	bool offsetStopLimitKnown;
 
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
@@ -350,10 +362,11 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 		MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static void DetermineSafeOldestOffset(MultiXactId oldestMXact);
+static void DetermineSafeOldestOffset(MultiXactOffset oldestMXact);
 static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
 		 MultiXactOffset start, uint32 distance);
-static MultiXactOffset find_multixact_start(MultiXactId multi);
+static bool SetOffsetVacuumLimit(bool finish_setup);
+static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMZeroPageXlogRec(int pageno,

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Thomas Munro

On Mon, Jun 1, 2015 at 4:55 PM, Noah Misch n...@leadboat.com wrote:
 While testing this (with inconsistent-multixact-fix-master.patch applied,
 FWIW), I noticed a nearby bug with a similar symptom.  TruncateMultiXact()
 omits the nextMXact==oldestMXact special case found in each other
 find_multixact_start() caller, so it reads the offset of a not-yet-created
 MultiXactId.  The usual outcome is to get rangeStart==0, so we truncate less
 than we could.  This can't make us truncate excessively, because
 nextMXact==oldestMXact implies no table contains any mxid.  If nextMXact
 happens to be the first of a segment, an error is possible.  Procedure:

 1. Make a fresh cluster.
 2. UPDATE pg_database SET datallowconn = true
 3. Consume precisely 131071 mxids.  Number of offsets per mxid is unimportant.
 4. vacuumdb --freeze --all

 Expected state after those steps:
 $ pg_controldata | grep NextMultiXactId
 Latest checkpoint's NextMultiXactId:  131072

 Checkpoint will fail like this:
 26699 2015-05-31 17:22:33.134 GMT LOG:  statement: checkpoint
 26661 2015-05-31 17:22:33.134 GMT DEBUG:  performing replication slot 
 checkpoint
 26661 2015-05-31 17:22:33.136 GMT ERROR:  could not access status of 
 transaction 131072
 26661 2015-05-31 17:22:33.136 GMT DETAIL:  Could not open file 
 pg_multixact/offsets/0002: No such file or directory.
 26699 2015-05-31 17:22:33.234 GMT ERROR:  checkpoint request failed
 26699 2015-05-31 17:22:33.234 GMT HINT:  Consult recent messages in the 
 server log for details.
 26699 2015-05-31 17:22:33.234 GMT STATEMENT:  checkpoint

 This does not block startup, and creating one mxid hides the problem again.
 Thus, it is not a top-priority bug like some other parts of this thread.  I
 mention it today mostly so it doesn't surprise hackers testing other fixes.

Thanks.   As mentioned elsewhere in the thread, I discovered that the
same problem exists for page boundaries, with a different error
message.  I've tried the attached repro scripts on 9.3.0, 9.3.5, 9.4.1
and master with the same results:

FATAL:  could not access status of transaction 2048
DETAIL:  Could not read from file pg_multixact/offsets/ at
offset 8192: Undefined error: 0.

FATAL:  could not access status of transaction 131072
DETAIL:  Could not open file pg_multixact/offsets/0002: No such file
or directory.

But, yeah, this isn't the bug we're looking for.

-- 
Thomas Munro
http://www.enterprisedb.com


checkpoint-page-boundary.sh
Description: Bourne shell script


checkpoint-segment-boundary.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

Robert Haas wrote:

 So here's a patch taking a different approach.

I tried to apply this to 9.3 but it's messy because of pgindent.  Anyone
would have a problem with me backpatching a pgindent run of multixact.c?

Also, you have a new function SlruPageExists, but we already have
SimpleLruDoesPhysicalPageExist.  No need for two copies of pretty much
the same code, I think.  Your code uses fstat() instead of
lseek(.., SEEK_END) but the exact technique used is probably not
relevant.

I think I like this approach better than your other patch, FWIW --
mainly because it seems simpler.

Will review.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

Alvaro Herrera wrote:

 Really, the whole question of how this code goes past the open() failure
 in SlruPhysicalReadPage baffles me.  I don't see any possible way for
 the file to be created ...

Hmm, the checkpointer can call TruncateMultiXact when in recovery, on
restartpoints. I wonder if in recovery this could trigger file creation.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Andres Freund

On 2015-06-03 15:01:46 -0300, Alvaro Herrera wrote:
 Andres Freund wrote:
  That's not necessarily the case though, given how the code currently
  works. In a bunch of places the SLRUs are accessed *before* having been
  made consistent by WAL replay. Especially if several checkpoints/vacuums
  happened during the base backup the assumed state (i.e. the mxacts
  checkpoints refer to) of the data directory soon after the initial
  start, and the state of pg_multixact/ won't necessarily match at all.
 
 Well, the log extract is quite simple; it says that as soon as the
 standby starts replay WAL, it fetches exactly one WAL segment, which
 contains exactly one online checkpoint record; this record contains
 oldestMulti=4624559, which belongs in offset file 0046.  But the
 basebackup only contains files starting from 004B onwards.  The base
 backup takes a long time, so my guess of what happened is that the
 backup label points to the start of the WAL stream (obviously), then the
 data files are copied to the standby; during this long process, a
 multixact truncation happens.  So by the time the base backup reaches
 the pg_multixact directory, the 0046 file has been removed.

Yes. That's precisely what I was concerned about above and elsewhere in
the thread. It's simply not ok to access a SLRU while not yet
consistent...

 One idea I had was: what if the oldestMulti pointed to another multi
 earlier in the same 0046 file, so that it is read-as-zeroes (and the
 file is created), and then a subsequent multixact truncate tries to read
 a later page in the file.  In SlruPhysicalReadPage() this would give a
 change for open() to not fail, and then read() can fail as above.
 However, we would have an earlier LOG message about reading as zeroes.
 
 Really, the whole question of how this code goes past the open() failure
 in SlruPhysicalReadPage baffles me.  I don't see any possible way for
 the file to be created ...

Wouldn't a previous WAL record zeroing another part of that segment
explain this? A zero sized segment pretty much would lead to this error,
right? Or were you able to check how things look after the failure?

 2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode
 2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: restored log file 
 000173DD00AD from archive
 2015-05-27 16:15:18 UTC [4782]: [5-1] user=,db= FATAL: could not access 
 status of transaction 4624559
 2015-05-27 16:15:18 UTC [4782]: [6-1] user=,db= DETAIL: Could not read from 
 file pg_multixact/offsets/0046 at offset 147456: Success.
 2015-05-27 16:15:18 UTC [4778]: [4-1] user=,db= LOG: startup process (PID 
 4782) exited with exit code 1
 2015-05-27 16:15:18 UTC [4778]: [5-1] user=,db= LOG: aborting startup due to 
 startup process failure

From this isn't not entirely clear where this error was triggered from.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Andres Freund

On 2015-06-03 00:42:55 -0300, Alvaro Herrera wrote:
 Thomas Munro wrote:
  On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera alvhe...@2ndquadrant.com 
  wrote:
   My guess is that the file existed, and perhaps had one or more pages,
   but the wanted page doesn't exist, so we tried to read but got 0 bytes
   back.  read() returns 0 in this case but doesn't set errno.
  
   I didn't find a way to set things so that the file exists but is of
   shorter contents than oldestMulti by the time the checkpoint record is
   replayed.
  
  I'm just starting to learn about the recovery machinery, so forgive me
  if I'm missing something basic here, but I just don't get this.  As I
  understand it, offsets/0046 should either have been copied with that
  page present in it if it existed before the backup started (apparently
  not in this case), or extended to contain it by WAL records that come
  after the backup label but before the checkpoint record that
  references it (also apparently not in this case).

That's not necessarily the case though, given how the code currently
works. In a bunch of places the SLRUs are accessed *before* having been
made consistent by WAL replay. Especially if several checkpoints/vacuums
happened during the base backup the assumed state (i.e. the mxacts
checkpoints refer to) of the data directory soon after the initial
start, and the state of pg_multixact/ won't necessarily match at all.

 Exactly --- that's the spot at which I am, also.  I have had this
 spinning in my head for three days now, and tried every single variation
 that I could think of, but like you I was unable to reproduce the issue.
 However, our customer took a second base backup and it failed in exactly
 the same way, module some changes to the counters (the file that
 didn't exist was 004B rather than 0046).  I'm still at a loss at what
 the failure mode is.  We must be missing some crucial detail ...

I might have missed it in this already long thread. Could you share a
bunch of details about hte case? It'd be very interesting to see the
contents of the backup label (to see where start/end are), the contents
of the initial checkpoint (to see which mxacts we assume to exist at
start) and what the initial contents of pg_multixact are (to match up).

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

Thomas Munro wrote:

 I have finally reproduced that error!  See attached repro shell script.
 
 The conditions are:
 
 1.  next multixact == oldest multixact (no active multixacts, pointing
 past the end)
 2.  next multixact would be the first item on a new page (multixact % 2048 == 
 0)
 3.  the page must not be the first in a segment (or we'd get the
 read-zeroes case)
 
 That gives you odds of 1/2048 * 31/32 * (probability of a wraparound
 vacuum followed by no multixact creations right before your backup
 checkpoint).  That seems like reasonably low odds... if it happened
 twice in a row, maybe I'm missing something here and there is some
 other way to get this...

Thanks, this is pretty cool (as far as these things go), but it's not
the scenario I see, in which the complained-about segment is very old by
the time it's truncated away by a checkpoint after freeze.  Segment
requested by checkpoint.oldestMulti is 0046 (offset 140k something --
just to clarify it's not at segment boundary), and the base backup
contains segments from 004B to 00D5.  My problem scenario has
oldestMulti close to 5 million and nextMulti close to 14 million.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

Andres Freund wrote:
 On 2015-06-03 00:42:55 -0300, Alvaro Herrera wrote:
  Thomas Munro wrote:
   On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera alvhe...@2ndquadrant.com 
   wrote:
My guess is that the file existed, and perhaps had one or more pages,
but the wanted page doesn't exist, so we tried to read but got 0 bytes
back.  read() returns 0 in this case but doesn't set errno.
   
I didn't find a way to set things so that the file exists but is of
shorter contents than oldestMulti by the time the checkpoint record is
replayed.
   
   I'm just starting to learn about the recovery machinery, so forgive me
   if I'm missing something basic here, but I just don't get this.  As I
   understand it, offsets/0046 should either have been copied with that
   page present in it if it existed before the backup started (apparently
   not in this case), or extended to contain it by WAL records that come
   after the backup label but before the checkpoint record that
   references it (also apparently not in this case).
 
 That's not necessarily the case though, given how the code currently
 works. In a bunch of places the SLRUs are accessed *before* having been
 made consistent by WAL replay. Especially if several checkpoints/vacuums
 happened during the base backup the assumed state (i.e. the mxacts
 checkpoints refer to) of the data directory soon after the initial
 start, and the state of pg_multixact/ won't necessarily match at all.

Well, the log extract is quite simple; it says that as soon as the
standby starts replay WAL, it fetches exactly one WAL segment, which
contains exactly one online checkpoint record; this record contains
oldestMulti=4624559, which belongs in offset file 0046.  But the
basebackup only contains files starting from 004B onwards.  The base
backup takes a long time, so my guess of what happened is that the
backup label points to the start of the WAL stream (obviously), then the
data files are copied to the standby; during this long process, a
multixact truncation happens.  So by the time the base backup reaches
the pg_multixact directory, the 0046 file has been removed.

2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode
2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: restored log file 
000173DD00AD from archive
2015-05-27 16:15:18 UTC [4782]: [5-1] user=,db= FATAL: could not access status 
of transaction 4624559
2015-05-27 16:15:18 UTC [4782]: [6-1] user=,db= DETAIL: Could not read from 
file pg_multixact/offsets/0046 at offset 147456: Success.
2015-05-27 16:15:18 UTC [4778]: [4-1] user=,db= LOG: startup process (PID 4782) 
exited with exit code 1
2015-05-27 16:15:18 UTC [4778]: [5-1] user=,db= LOG: aborting startup due to 
startup process failure

One idea I had was: what if the oldestMulti pointed to another multi
earlier in the same 0046 file, so that it is read-as-zeroes (and the
file is created), and then a subsequent multixact truncate tries to read
a later page in the file.  In SlruPhysicalReadPage() this would give a
change for open() to not fail, and then read() can fail as above.
However, we would have an earlier LOG message about reading as zeroes.

Really, the whole question of how this code goes past the open() failure
in SlruPhysicalReadPage baffles me.  I don't see any possible way for
the file to be created ...

  Exactly --- that's the spot at which I am, also.  I have had this
  spinning in my head for three days now, and tried every single variation
  that I could think of, but like you I was unable to reproduce the issue.
  However, our customer took a second base backup and it failed in exactly
  the same way, module some changes to the counters (the file that
  didn't exist was 004B rather than 0046).  I'm still at a loss at what
  the failure mode is.  We must be missing some crucial detail ...
 
 I might have missed it in this already long thread. Could you share a
 bunch of details about hte case? It'd be very interesting to see the
 contents of the backup label (to see where start/end are), the contents
 of the initial checkpoint (to see which mxacts we assume to exist at
 start) and what the initial contents of pg_multixact are (to match up).

pg_xlogdump says about the checkpoint record:

rmgr: XLOGlen (rec/tot): 72/   104, tx:  0, lsn: 
73DD/AD834470, prev 73DD/AD8343B8, bkp: , desc: checkpoint: redo 
73DD/4003F648; tli 1; prev tli 1; fpw true; xid 1/3530575042; oid 81303440; 
multi 14003656; offset 66311341; oldest xid 2530472730 in DB 17081243; oldest 
multi 4624559 in DB 17081243; oldest running xid 3530551992; online

I don't have the backup label.  But a listing of the files in
pg_multixact/offsets in the master has this:

./offsets:
total 35816
drwx-- 2 postgres postgres 4096 May 27 16:25 .
drwx-- 4 postgres postgres 4096 Jan 3 03:32 ..
-rw--- 1 postgres postgres 262144 Apr 20 13:07 004C
-rw--- 1 postgres postgres 262144 Apr 20 14:36 004D
-rw---

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Thomas Munro

On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera alvhe...@2ndquadrant.com wrote:
 My guess is that the file existed, and perhaps had one or more pages,
 but the wanted page doesn't exist, so we tried to read but got 0 bytes
 back.  read() returns 0 in this case but doesn't set errno.

 I didn't find a way to set things so that the file exists but is of
 shorter contents than oldestMulti by the time the checkpoint record is
 replayed.

I'm just starting to learn about the recovery machinery, so forgive me
if I'm missing something basic here, but I just don't get this.  As I
understand it, offsets/0046 should either have been copied with that
page present in it if it existed before the backup started (apparently
not in this case), or extended to contain it by WAL records that come
after the backup label but before the checkpoint record that
references it (also apparently not in this case).  If neither of these
things happened then that is completely different from the
segment-does-not-exist case where we read zeroes if in recovery on the
assumption that later WAL records must be about to delete the file.
There is no way that future WAL records will make an existing segment
file shorter! So at this point don't we know that there is something
wrong with the backup itself?

Put another way, if you bring this up under 9.4.1, won't it also be
unable to access multixact 4624559 at this point?  Of course it won't
try to do so during recovery like 9.4.2 does, but I'm just trying to
understand how this is supposed to work for 9.4.1 if it needs to
access that multixact for other reasons once normal running is reached
(say you recover up to that checkpoint, and then run
pg_get_multixact_members, or a row has that xmax and its members to be
looked up by a vacuum or any normal transaction).  In other words,
isn't this a base backup that is somehow broken, not at all like the
pg_upgrade corruption case which involved the specific case of
multixact 1 and an entirely missing segment file, and 9.4.2 just tells
you about it sooner than 9.4.1?

For what it's worth, I've also spent a lot of time trying to reproduce
basebackup problems with multixact creation, vacuums and checkpoints
injected at various points between copying backup label, pg_multixact,
and pg_control.  I have so far failed to produce anything more
interesting than the 'reading zeroes' case (see attached
copy-after-trunction.sh) and a case where the control file points at
a segment that doesn't exist, but it doesn't matter because the backup
label points at a checkpoint from a time when it did and
oldestMultiXactId is updated from there, and then procedes exactly as
it should (see copy-before-truncation.sh).  I updated my scripts to
look a bit more like your nicely automated example (though mine use a
different trick to create small quantities of multixacts so they run
against unpatched master).  I have also been considering a scenario
where multixact ID wraparound occurs during basebackup with some
ordering that causes trouble, but I don't yet see why it would break
if you replay the WAL from the backup label checkpoint (and I think
the repro would take days/weeks to run...)

-- 
Thomas Munro
http://www.enterprisedb.com


copy-after-truncation.sh
Description: Bourne shell script


copy-before-truncation.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Alvaro Herrera

Thomas Munro wrote:
 On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera alvhe...@2ndquadrant.com 
 wrote:
  My guess is that the file existed, and perhaps had one or more pages,
  but the wanted page doesn't exist, so we tried to read but got 0 bytes
  back.  read() returns 0 in this case but doesn't set errno.
 
  I didn't find a way to set things so that the file exists but is of
  shorter contents than oldestMulti by the time the checkpoint record is
  replayed.
 
 I'm just starting to learn about the recovery machinery, so forgive me
 if I'm missing something basic here, but I just don't get this.  As I
 understand it, offsets/0046 should either have been copied with that
 page present in it if it existed before the backup started (apparently
 not in this case), or extended to contain it by WAL records that come
 after the backup label but before the checkpoint record that
 references it (also apparently not in this case).

Exactly --- that's the spot at which I am, also.  I have had this
spinning in my head for three days now, and tried every single variation
that I could think of, but like you I was unable to reproduce the issue.
However, our customer took a second base backup and it failed in exactly
the same way, module some changes to the counters (the file that
didn't exist was 004B rather than 0046).  I'm still at a loss at what
the failure mode is.  We must be missing some crucial detail ...


-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On 2015-06-01 14:22:32 -0400, Robert Haas wrote:
 On Mon, Jun 1, 2015 at 4:58 AM, Andres Freund and...@anarazel.de wrote:
  The lack of WAL logging actually has caused problems in the 9.3.3 (?)
  era, where we didn't do any truncation during recovery...
 
 Right, but now we're piggybacking on the checkpoint records, and I
 don't have any evidence that this approach can't be made robust.  It's
 possible that it can't be made robust, but that's not currently clear.

Well, it's possible that it can be made work without problems. But I
think robust is something different. Having to look at slrus, during
recovery, to find out what to truncate puts more intelligence/complexity
in the recovery path than I'm comfortable with.

  By the time we've reached the minimum recovery point, they will have
  been recreated by the same WAL records that created them in the first
  place.
 
  I'm not sure that's true. I think we could end up errorneously removing
  files that were included in the base backup. Anyway, let's focus on your
  patch for now.
 
 OK, but I am interested in discussing the other thing too.  I just
 can't piece together the scenario myself - there may well be one.  The
 base backup will begin replay from the checkpoint caused by
 pg_start_backup() and remove anything that wasn't there at the start
 of the backup.  But all of that stuff should get recreated by the time
 we reach the minimum recovery point (end of backup).

I'm not sure if it's reprouceably borked. What about this scenario:
1) pg_start_backup() is called, creates a checkpoint.
2) 2**31 multixacts are created, possibly with several checkpoints
   inbetween
3) pg_multixact is copied
4) basebackup finishes

Unless I'm missing something this will lead to a crash recovery startup
where the first call to TruncateMultiXact() will have
MultiXactState-lastCheckpointedOldest wildly inconsistent with
GetOldestMultiXactOnDisk() return value. Possibly leading to truncation
being skipped errorneously.  Whether that's a problem I'm not yet
entirely sure.

But what *definitely* looks wrong to me is that a TruncateMultiXact() in
this scenario now (since a couple weeks ago) does a
SimpleLruReadPage_ReadOnly() in the members slru via
find_multixact_start(). That just won't work acceptably when we're not
yet consistent. There very well could not be a valid members segment at
that point?  Am I missing something?

  I'm more worried about the cases where we didn't ever actually badly
  wrap around (i.e. overwrite needed data); but where that's not clear on
  the standby because the base backup isn't in a consistent state.
 
 I agree. The current patch tries to make it so that we never call
 find_multixact_start() while in recovery, but it doesn't quite
 succeed: the call in TruncateMultiXact still happens during recovery,
 but only once we're sure that the mxact we plan to call it on actually
 exists on disk.  That won't be called until we replay the first
 checkpoint, but that might still be prior to consistency.

It'll pretty much *always* be before we reach consistency, right? It'll
called on the checkpoint created by pg_start_backup()?

I don't think the presence check (that's GetOldestMultiXactOnDisk() in
TruncateMultiXact(), right) helps very much. There's no guarantee at all
that offsets and members are in any way consistent with each other. Or
in themselves for that matter, the copy could very well have been in the
middle of a write the slru page.

I think at the very least we'll have to skip this step while not yet
consistent. That really sucks, because we'll possibly end up with
multixacts that are completely filled by the time we've reached
consistency.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On Tue, Jun 2, 2015 at 1:21 AM, Noah Misch n...@leadboat.com wrote:
 On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote:
 On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch n...@leadboat.com wrote:
  On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote:
  SetMultiXactIdLimit() bracketed certain parts of its
  logic with if (!InRecovery), but those guards were ineffective because
  it gets called before InRecovery is set in the first place.
 
  SetTransactionIdLimit() checks InRecovery for the same things, and it is
  called at nearly the same moments as SetMultiXactIdLimit().  Do you have a
  sense of whether it is subject to similar problems as a result?

 Well, I think it's pretty weird that those things will get done before
 beginning recovery, even on an inconsistent cluster, but not during
 recovery.  That is pretty strange.  I don't see that it can actually
 do any worse than emit a few log messages at the start of recovery
 that won't show up again until the end of recovery, though.

 Granted.  Would it be better to update both functions at the same time, and
 perhaps to make that a master-only change?  Does the status quo cause more
 practical harm via SetMultiXactIdLimit() than via SetTransactionIdLimit()?

It does in the case of the call to find_multixact_start().  If that
fails, we take the server down for no good reason, as demonstrated by
the original report. I'll revert the changes to the other two things
in this function that I changed to be protected by did_trim.  There's
no special reason to think that's a necessary change.

  1. Moves the call to DetermineSafeOldestOffset() that appears in
  StartupMultiXact() to TrimMultiXact(), so that we don't try to do this
  until we're consistent.  Also, instead of passing
  MultiXactState-oldestMultiXactId, pass the newer of that value and
  the earliest offset that exists on disk.  That way, it won't try to
  read data that's not there.
 
  Perhaps emit a LOG message when we do that, since it's our last 
  opportunity to
  point to the potential data loss?

 If the problem is just that somebody minmxid got set to 1 instead of
 the real value, I think that there is no data loss, because none of
 those older values are actually present there.  But we could add a LOG
 message anyway.  How do you suggest that we phrase that?

 There's no data loss if 1 = true_minmxid = 2^31 at the time minmxid got set
 to 1.  Otherwise, data loss is possible.

Yes, but in that scenario, the log message you propose wouldn't be
triggered.  If true_minmxid  2^31, then the stored minmxid will not
precede the files on disk; it will follow it (assuming the older stuff
hasn't been truncated, as is likely).  So the message would be
essentially:

LOG: you didn't lose data.  but if exactly the opposite of what this
message is telling you about had happened, then you would have.
DETAIL: Have a nice day.

 I don't hope for an actionable
 message, but we might want a reporter to grep logs for it when we diagnose
 future reports.  Perhaps this:

   missing pg_multixact/members files; begins at MultiXactId %u, expected %u

This seems misleading.  In the known failure case, it's not that the
pg_multixact files are unexpectedly missing; it's that we incorrectly
think that they should still be there.  Maybe:

oldest MultiXactId on disk %u follows expected oldest MultiXact %u

 For good measure, perhaps emit this when lastCheckpointedOldest  earliest by
 more than one segment:

   excess pg_multixact/members files; begins at MultiXactId %u, expected %u

So, this scenario will happen whenever the system was interrupted in
the middle of a truncation, or when the system is started from a base
backup and a truncation happened after files were copied.  I'm wary of
giving users the idea that this is an atypical event.  Perhaps a
message at DEBUG1?

 I'm not sure what you mean about it becoming too old.  At least with
 that fix, it should get updated to exactly the first file that we
 didn't remove.  Isn't that right?

 Consider a function raw_GOMXOD() that differs from GetOldestMultiXactOnDisk()
 only in that it never reads or writes the cache.  I might expect
 oldestMultiXactOnDisk==raw_GOMXOD() if oldestMultiXactOnDiskValid, and that
 does hold most of the time.  It does not always hold between the start of the
 quoted code's SimpleLruTruncate() and its oldestMultiXactOnDisk assignment.
 That's because raw_GOMXOD() changes at the instant we unlink the oldest
 segment, but oldestMultiXactOnDisk changes later.  Any backend may call
 GetOldestMultiXactOnDisk() via SetMultiXactIdLimit().  If a backend does that
 concurrent with the checkpointer running TruncateMultiXact() and sees a stale
 oldestMultiXactOnDisk, is that harmless?

As far as I can tell, it's pretty much harmless.  I mean, we've
already discussed the risk that the head and tail could get too far
apart, because really it should be TruncateMultiXact(), not
SetMultiXactIdLimit(), that establishes the new stop point.  But

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On 2015-06-02 11:29:24 -0400, Robert Haas wrote:
 On Tue, Jun 2, 2015 at 8:56 AM, Andres Freund and...@anarazel.de wrote:
  But what *definitely* looks wrong to me is that a TruncateMultiXact() in
  this scenario now (since a couple weeks ago) does a
  SimpleLruReadPage_ReadOnly() in the members slru via
  find_multixact_start(). That just won't work acceptably when we're not
  yet consistent. There very well could not be a valid members segment at
  that point?  Am I missing something?
 
 Yes: that code isn't new.

Good point.

 TruncateMultiXact() called SimpleLruReadPage_ReadOnly() directly in
 9.3.0 and every subsequent release until 9.3.7/9.4.2.

But back then TruncateMultiXact() wasn't called during recovery. But
you're right in that it didn't seem to have reproduced attributable
bugreprorts since 9.3.2 where vacuuming during recovery was
introduced. So it indeed doesn't seem as urgent as fixing the new
callsites.

 That would be a departure from the behavior of every existing release
 that includes this code based on, to my knowledge, zero trouble
 reports.

On the other hand we're now at about bug #5 attributeable to the odd way
truncation works for multixacts. It's obviously complex and hard to get
right. It makes it harder to cope with the wrong values left in
datminxid etc. So I'm still wondering whether fixing this for good isn't
the better approach.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On Tue, Jun 2, 2015 at 11:27 AM, Andres Freund and...@anarazel.de wrote:
 On 2015-06-02 11:16:22 -0400, Robert Haas wrote:
 I'm having trouble figuring out what to do about this.  I mean, the
 essential principle of this patch is that if we can't count on
 relminmxid, datminmxid, or the control file to be accurate, we can at
 least look at what is present on the disk.  If we also cannot count on
 that to be accurate, we are left without any reliable source of
 information.  Consider a hypothetical cluster where all our stored
 minmxids of whatever form are corrupted (say, all change to 1) and in
 addition there are stray files in pg_multixact.  I don't think there's
 really any way to get ourselves out of trouble in that scenario.

 If we were to truncate after vacuum, and only on the primary (via WAL
 logging), we could, afaics, just rely on all the values to be
 recomputed. I mean relminmxid will be recomputed after a vacuum, and
 thus, after some time, will datminmxid and the control file value.  We
 could just force a value of 1 to always trigger anti-wraparound vacuums
 (or wait for that to happen implicitly, to delay the impact?). That'll
 then should then fix the problem in a relatively short amount of time?

The exact circumstances under which we're willing to replace a
relminmxid with a newly-computed one that differs are not altogether
clear to me, but there's an if statement protecting that logic, so
there are some circumstances in which we'll leave the existing value
intact.  If we force non-stop vacuuming in that scenario, autovacuum
will just run like crazy without accomplishing anything, which
wouldn't be good.  It would similarly do so when the oldest MXID
reference in the relation is in fact 1, but that value can't be
vacuumed away yet.

Also, the database might be really big.  Even if it were true that a
full scan of every table would get us out of this state, describing
the time that it would take to do that as relatively short seems to
me to be considerably understating the impact of a full-cluster
VACUUM.

With regard to the more general question of WAL-logging this, are you
going to work on that?  Are you hoping Alvaro or I will work on that?
Should we draw straws?  It seems like somebody needs to do it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On 2015-06-02 11:16:22 -0400, Robert Haas wrote:
 I'm having trouble figuring out what to do about this.  I mean, the
 essential principle of this patch is that if we can't count on
 relminmxid, datminmxid, or the control file to be accurate, we can at
 least look at what is present on the disk.  If we also cannot count on
 that to be accurate, we are left without any reliable source of
 information.  Consider a hypothetical cluster where all our stored
 minmxids of whatever form are corrupted (say, all change to 1) and in
 addition there are stray files in pg_multixact.  I don't think there's
 really any way to get ourselves out of trouble in that scenario.

If we were to truncate after vacuum, and only on the primary (via WAL
logging), we could, afaics, just rely on all the values to be
recomputed. I mean relminmxid will be recomputed after a vacuum, and
thus, after some time, will datminmxid and the control file value.  We
could just force a value of 1 to always trigger anti-wraparound vacuums
(or wait for that to happen implicitly, to delay the impact?). That'll
then should then fix the problem in a relatively short amount of time?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On Tue, Jun 2, 2015 at 8:56 AM, Andres Freund and...@anarazel.de wrote:
 But what *definitely* looks wrong to me is that a TruncateMultiXact() in
 this scenario now (since a couple weeks ago) does a
 SimpleLruReadPage_ReadOnly() in the members slru via
 find_multixact_start(). That just won't work acceptably when we're not
 yet consistent. There very well could not be a valid members segment at
 that point?  Am I missing something?

Yes: that code isn't new.

TruncateMultiXact() called SimpleLruReadPage_ReadOnly() directly in
9.3.0 and every subsequent release until 9.3.7/9.4.2.  The only thing
that's changed is that we've moved that logic into a function called
find_multixact_start() instead of having it directly inside that
function.  We did that because we needed to use the same logic in some
other places.  The reason why 9.3.7/9.4.2 are causing problems for
people that they didn't have previously is because those new,
additional call sites were poorly chosen and didn't include adequate
protection against calling that function with an invalid input value.
What this patch is about is getting back to the situation that we were
in from 9.3.0 - 9.3.6 and 9.4.0 - 9.4.1, where TruncateMultiXact() did
the thing that you're complaining about here but no one else did.

From my point of view, I think that you are absolutely right to
question what's going on in TruncateMultiXact().  It's kooky, and
there may well be bugs buried there.  But I don't think fixing that
should be the priority right now, because we have zero reports of
problems attributable to that logic.  I think the priority should be
on undoing the damage that we did in 9.3.7/9.4.2, when we made other
places to do the same thing.  We started getting trouble reports
attributable to those changes *almost immediately*, which means that
whether or not TruncateMultiXact() is broken, these new call sites
definitely are.  I think we really need to fix those new places ASAP.

 I think at the very least we'll have to skip this step while not yet
 consistent. That really sucks, because we'll possibly end up with
 multixacts that are completely filled by the time we've reached
 consistency.

That would be a departure from the behavior of every existing release
that includes this code based on, to my knowledge, zero trouble
reports.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On Tue, Jun 2, 2015 at 11:36 AM, Andres Freund and...@anarazel.de wrote:
 That would be a departure from the behavior of every existing release
 that includes this code based on, to my knowledge, zero trouble
 reports.

 On the other hand we're now at about bug #5 attributeable to the odd way
 truncation works for multixacts. It's obviously complex and hard to get
 right. It makes it harder to cope with the wrong values left in
 datminxid etc. So I'm still wondering whether fixing this for good isn't
 the better approach.

It may well be.  But I think we should do something more surgical
first.  Perhaps we can justify the pain and risk of making changes to
the WAL format in the back-branches, but let's not do it in a rush.
If we can get this patch to a state where it undoes the damage
inflicted in 9.3.7/9.4.2, then we will be in a state where we have as
much reliability as we had in 9.3.6 plus the protections against
member-space wraparound added in 9.3.7 - which, like the patch I'm
proposing now, were directly motivated by multiple, independent bug
reports.  That seems like a good place to get to.  If nothing else, it
will buy us some time to figure out what else we want to do.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Noah Misch

On Tue, Jun 02, 2015 at 11:16:22AM -0400, Robert Haas wrote:
 On Tue, Jun 2, 2015 at 1:21 AM, Noah Misch n...@leadboat.com wrote:
  On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote:

  Granted.  Would it be better to update both functions at the same time, and
  perhaps to make that a master-only change?  Does the status quo cause more
  practical harm via SetMultiXactIdLimit() than via SetTransactionIdLimit()?
 
 It does in the case of the call to find_multixact_start().  If that
 fails, we take the server down for no good reason, as demonstrated by
 the original report. I'll revert the changes to the other two things
 in this function that I changed to be protected by did_trim.

Sounds good.

  There's no data loss if 1 = true_minmxid = 2^31 at the time minmxid got 
  set
  to 1.  Otherwise, data loss is possible.
 
 Yes, but in that scenario, the log message you propose wouldn't be
 triggered.  If true_minmxid  2^31, then the stored minmxid will not
 precede the files on disk; it will follow it (assuming the older stuff
 hasn't been truncated, as is likely).

Ah, quite right.

missing pg_multixact/members files; begins at MultiXactId %u, expected 
  %u
 
 This seems misleading.  In the known failure case, it's not that the
 pg_multixact files are unexpectedly missing; it's that we incorrectly
 think that they should still be there.  Maybe:
 
 oldest MultiXactId on disk %u follows expected oldest MultiXact %u

Your wording is better.

  For good measure, perhaps emit this when lastCheckpointedOldest  earliest 
  by
  more than one segment:
 
excess pg_multixact/members files; begins at MultiXactId %u, expected %u
 
 So, this scenario will happen whenever the system was interrupted in
 the middle of a truncation, or when the system is started from a base
 backup and a truncation happened after files were copied.  I'm wary of
 giving users the idea that this is an atypical event.  Perhaps a
 message at DEBUG1?

DEBUG1 works for me, or feel free to leave it out.

  I can see that there might be an issue there, but I can't quite put my
  finger on it well enough to say that it definitely is an issue.  This
  code is examining the offsets space rather than the members space, and
  the protections against offsets wraparound have been there since the
  original commit of this feature
  (0ac5ad5134f2769ccbaefec73844f8504c4d6182).  To my knowledge we have
  no concrete evidence that there's ever been a problem in this area.
  It seems like it might be safer to rejigger that code so that it
  considers distance-behind-current rather than using the wrapped
  comparison logic, but I'm reluctant to start rejiggering more things
  without knowing what specifically I'm fixing.
 
  Anything that could cause the pg_multixact/offsets tail to rotate from being
  in the past to being in the future poses this risk.  (This is the tail from
  the perspective of files on disk; pg_control, datminmxid, and MultiXactState
  notions of the tail play no part here.)  I had in mind that the pg_upgrade
  datminmxid=1 bug could be a tool for achieving that, but I've been
  unsuccessful so far at building a credible thought experiment around it.  
  Near
  the beginning of your reply, you surmised that this could happen between a
  VACUUM's SetMultiXactIdLimit() and the next checkpoint's 
  TruncateMultiXact().
  Another vector is unlink() failure on a segment file.  SlruDeleteSegment()
  currently ignores the unlink() return value; the only harm has been some 
  disk
  usage.  With GetOldestMultiXactOnDisk() as-proposed, successful unlink() is
  mandatory to forestall the appearance of a wrapped state.
 
 I'm having trouble figuring out what to do about this.  I mean, the
 essential principle of this patch is that if we can't count on
 relminmxid, datminmxid, or the control file to be accurate, we can at
 least look at what is present on the disk.  If we also cannot count on
 that to be accurate, we are left without any reliable source of
 information.  Consider a hypothetical cluster where all our stored
 minmxids of whatever form are corrupted (say, all change to 1) and in
 addition there are stray files in pg_multixact.  I don't think there's
 really any way to get ourselves out of trouble in that scenario.

We could notice the problem and halt.  You mentioned above the possibility to
have SlruScanDirCbFindEarliest() check distance-behind-current.  Suppose it
used the current oldestMulti from pg_control as a reference point and
discovered the multixact on disk (a) most far behind that reference point and
(b) most far ahead of that reference point.  If MultiXactIdPrecedes(a, b),
we're good; adopt (a) as the new datminmxid.  Account for the possibility that
the space isn't really wrapped, but out reference point was totally wrong.
Once we determine that the space is wrapped, bail.

As you discuss downthread, the mxids actually present in t_max define the ones
we need.  We could scan them all and set authoritative

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On 2015-06-02 11:37:02 -0400, Robert Haas wrote:
 The exact circumstances under which we're willing to replace a
 relminmxid with a newly-computed one that differs are not altogether
 clear to me, but there's an if statement protecting that logic, so
 there are some circumstances in which we'll leave the existing value
 intact.

I guess we'd have to change that then.

 It would similarly do so when the oldest MXID reference in the
 relation is in fact 1, but that value can't be vacuumed away yet.

I'd thought of just forcing consumption of one multixact in that
case. Not pretty, but imo acceptable.

 Also, the database might be really big.  Even if it were true that a
 full scan of every table would get us out of this state, describing
 the time that it would take to do that as relatively short seems to
 me to be considerably understating the impact of a full-cluster
 VACUUM.

Well. We're dealing with a corrupted cluster. Having a way out that's
done automatically, even if it takes a long while, is pretty good
already. In many cases the corruption (i.e. pg_upgrade) happened long
ago, so the table's relminmxid will already have been recomputed.  I
think that's acceptable.

 With regard to the more general question of WAL-logging this, are you
 going to work on that?  Are you hoping Alvaro or I will work on that?
 Should we draw straws?  It seems like somebody needs to do it.

I'm willing to invest the time to develop an initial version, but I'll
need help evaluating it. I don't have many testing resources available
atm, and I'm not going to trust stuff I developed while travelling by
just looking at the code.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On Tue, Jun 2, 2015 at 11:44 AM, Andres Freund and...@anarazel.de wrote:
 On 2015-06-02 11:37:02 -0400, Robert Haas wrote:
 The exact circumstances under which we're willing to replace a
 relminmxid with a newly-computed one that differs are not altogether
 clear to me, but there's an if statement protecting that logic, so
 there are some circumstances in which we'll leave the existing value
 intact.

 I guess we'd have to change that then.

Yeah, but first we'd need to assess why it's like that.  Tom was the
one who installed the current logic, but I haven't been able to fully
understand it.

 It would similarly do so when the oldest MXID reference in the
 relation is in fact 1, but that value can't be vacuumed away yet.

 I'd thought of just forcing consumption of one multixact in that
 case. Not pretty, but imo acceptable.

What if multixact 1 still has living members?

 Also, the database might be really big.  Even if it were true that a
 full scan of every table would get us out of this state, describing
 the time that it would take to do that as relatively short seems to
 me to be considerably understating the impact of a full-cluster
 VACUUM.

 Well. We're dealing with a corrupted cluster. Having a way out that's
 done automatically, even if it takes a long while, is pretty good
 already. In many cases the corruption (i.e. pg_upgrade) happened long
 ago, so the table's relminmxid will already have been recomputed.  I
 think that's acceptable.

I'm a long way from being convinced the automated recovery is
possible.  There are so many different scenarios here that it's very
difficult to reason generally about what the right thing to do is.
I agree it would be nice if we had it, though.

 With regard to the more general question of WAL-logging this, are you
 going to work on that?  Are you hoping Alvaro or I will work on that?
 Should we draw straws?  It seems like somebody needs to do it.

 I'm willing to invest the time to develop an initial version, but I'll
 need help evaluating it. I don't have many testing resources available
 atm, and I'm not going to trust stuff I developed while travelling by
 just looking at the code.

I'm willing to help with that.  Hopefully I'm not the only one, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Andres Freund

On 2015-05-31 07:51:59 -0400, Robert Haas wrote:
  1) We continue determining the oldest 
  SlruScanDirectory(SlruScanDirCbFindEarliest)
 on the master to find the oldest offsets segment to
 truncate. Alternatively, if we determine it to be safe, we could use
 oldestMulti to find that.
  2) SlruScanDirCbRemoveMembers is changed to return the range of members
 to remove, instead of doing itself
  3) We wal log [oldest offset segment guaranteed to not be alive,
 nextmulti) for offsets, and [oldest members segment guaranteed to not be 
  alive,
 nextmultioff), and redo truncations for the entire range during
 recovery.
 
  I'm pretty tired right now, but this sounds doable.
 
 I'm probably biased here, but I think we should finish reviewing,
 testing, and committing my patch before we embark on designing this.

Probably, yes. I am wondering whether doing this immediately won't end
up making some things simpler and more robust though.

 So far we have no reports of trouble attributable to the lack of the
 WAL-logging support discussed here, as opposed to several reports of
 trouble from the status quo within days of release.

The lack of WAL logging actually has caused problems in the 9.3.3 (?)
era, where we didn't do any truncation during recovery...

 By the time we've reached the minimum recovery point, they will have
 been recreated by the same WAL records that created them in the first
 place.

I'm not sure that's true. I think we could end up errorneously removing
files that were included in the base backup. Anyway, let's focus on your
patch for now.

 If, in the previous
 replay, we had wrapped all the way around, some of the stuff we keep
 may actually already have been overwritten by future WAL records, but
 they'll be overwritten again.  Now, that could mess up our
 determination of which members to remove, I guess, but I'm not clear
 that actually matters either: if the offsets space has wrapped around,
 the members space will certainly have wrapped around as well, so we
 can remove anything we like at this stage and we're still OK.  I agree
 this is ugly the way it is, but where is the actual bug?

I'm more worried about the cases where we didn't ever actually badly
wrap around (i.e. overwrite needed data); but where that's not clear on
the standby because the base backup isn't in a consistent state.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Robert Haas

On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch n...@leadboat.com wrote:
 Incomplete review, done in a relative rush:

Thanks.

 On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote:
 OK, here's a patch.  Actually two patches, differing only in
 whitespace, for 9.3 and for master (ha!).  I now think that the root
 of the problem here is that DetermineSafeOldestOffset() and
 SetMultiXactIdLimit() were largely ignorant of the possibility that
 they might be called at points in time when the cluster was
 inconsistent.

 A cause perhaps closer to the root is commit f741300 moving truncation from
 VACUUM to checkpoints.  CLOG has given us deep experience with VACUUM-time
 truncation.  Commit f6a6c46d and this patch are about bringing CHECKPOINT-time
 truncation up to the same level.

 Speaking of commit f6a6c46d, it seems logical that updating allocation stop
 limits should happen proximate to truncation.  That's currently the case for
 CLOG (vac_truncate_clog() does both) and pg_multixact/members (checkpoint's
 TruncateMultiXact() call does both).  However, pg_multixact/offsets is
 truncated from TruncateMultiXact(), but vac_truncate_clog() updates its limit.
 I did not distill an errant test case, but this is fishy.

Good point.  Because we confine ourselves to using half the offset
space, it seems much harder for us to get into trouble here than it is
with members. The first scenario that occurred to me is that the SLRU
might actually wrap.  That seems tough, though: between one checkpoint
and the next, vacuum would need to advance oldest_datminmxid by 2^31
MXIDs while generating 2^31 new ones, or something like that.  That
doesn't seem real plausible. But then it occurred to me that it's
probably sufficient to advance the head of the SLRU far enough that
TruncateMultiXact things that the tail is in the future instead of in
the past.  I see no reason why that couldn't happen.  Then we'd end up
leaving some files behind that we should have removed.  I'm not sure
exactly what problem that would cause; would they just get overwritten
on the next pass through the space, or would they cause errors?  I
have not had time to check.

 SetMultiXactIdLimit() bracketed certain parts of its
 logic with if (!InRecovery), but those guards were ineffective because
 it gets called before InRecovery is set in the first place.

 SetTransactionIdLimit() checks InRecovery for the same things, and it is
 called at nearly the same moments as SetMultiXactIdLimit().  Do you have a
 sense of whether it is subject to similar problems as a result?

Well, I think it's pretty weird that those things will get done before
beginning recovery, even on an inconsistent cluster, but not during
recovery.  That is pretty strange.  I don't see that it can actually
do any worse than emit a few log messages at the start of recovery
that won't show up again until the end of recovery, though.

 1. Moves the call to DetermineSafeOldestOffset() that appears in
 StartupMultiXact() to TrimMultiXact(), so that we don't try to do this
 until we're consistent.  Also, instead of passing
 MultiXactState-oldestMultiXactId, pass the newer of that value and
 the earliest offset that exists on disk.  That way, it won't try to
 read data that's not there.

 Perhaps emit a LOG message when we do that, since it's our last opportunity to
 point to the potential data loss?

If the problem is just that somebody minmxid got set to 1 instead of
the real value, I think that there is no data loss, because none of
those older values are actually present there.  But we could add a LOG
message anyway.  How do you suggest that we phrase that?

 +  * PostgreSQL 9.3.0 through 9.3.6 and PostgreSQL 9.4.0 through 9.4.1
 +  * had bugs that could allow users who reached those release through

 s/release/releases/

Fixed.

 @@ -2859,6 +2947,14 @@ TruncateMultiXact(void)
   SimpleLruTruncate(MultiXactOffsetCtl,
 
 MultiXactIdToOffsetPage(oldestMXact));

 + /* Update oldest-on-disk value in shared memory. */
 + earliest = range.rangeStart * MULTIXACT_OFFSETS_PER_PAGE;
 + if (earliest  FirstMultiXactId)
 + earliest = FirstMultiXactId;
 + LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 + Assert(MultiXactState-oldestMultiXactOnDiskValid);
 + MultiXactState-oldestMultiXactOnDiskValid = earliest;

 That last line needs s/Valid//, I presume.  Is it okay that
 oldestMultiXactOnDisk becomes too-old during TruncateMultiXact(), despite its
 Valid indicator remaining true?

Ay yai yay.  Yes, s/Valid//.

I'm not sure what you mean about it becoming too old.  At least with
that fix, it should get updated to exactly the first file that we
didn't remove.  Isn't that right?

 +static MultiXactOffset
 +GetOldestMultiXactOnDisk(void)
 +{

 + SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, 
 trunc);
 + earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE;
 + if (earliest

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Robert Haas

On Mon, Jun 1, 2015 at 4:58 AM, Andres Freund and...@anarazel.de wrote:
 I'm probably biased here, but I think we should finish reviewing,
 testing, and committing my patch before we embark on designing this.

 Probably, yes. I am wondering whether doing this immediately won't end
 up making some things simpler and more robust though.

I'm open to being convinced of that, but as of this moment I'm not
seeing any clear-cut evidence that we need to go so far.

 So far we have no reports of trouble attributable to the lack of the
 WAL-logging support discussed here, as opposed to several reports of
 trouble from the status quo within days of release.

 The lack of WAL logging actually has caused problems in the 9.3.3 (?)
 era, where we didn't do any truncation during recovery...

Right, but now we're piggybacking on the checkpoint records, and I
don't have any evidence that this approach can't be made robust.  It's
possible that it can't be made robust, but that's not currently clear.

 By the time we've reached the minimum recovery point, they will have
 been recreated by the same WAL records that created them in the first
 place.

 I'm not sure that's true. I think we could end up errorneously removing
 files that were included in the base backup. Anyway, let's focus on your
 patch for now.

OK, but I am interested in discussing the other thing too.  I just
can't piece together the scenario myself - there may well be one.  The
base backup will begin replay from the checkpoint caused by
pg_start_backup() and remove anything that wasn't there at the start
of the backup.  But all of that stuff should get recreated by the time
we reach the minimum recovery point (end of backup).

 If, in the previous
 replay, we had wrapped all the way around, some of the stuff we keep
 may actually already have been overwritten by future WAL records, but
 they'll be overwritten again.  Now, that could mess up our
 determination of which members to remove, I guess, but I'm not clear
 that actually matters either: if the offsets space has wrapped around,
 the members space will certainly have wrapped around as well, so we
 can remove anything we like at this stage and we're still OK.  I agree
 this is ugly the way it is, but where is the actual bug?

 I'm more worried about the cases where we didn't ever actually badly
 wrap around (i.e. overwrite needed data); but where that's not clear on
 the standby because the base backup isn't in a consistent state.

I agree. The current patch tries to make it so that we never call
find_multixact_start() while in recovery, but it doesn't quite
succeed: the call in TruncateMultiXact still happens during recovery,
but only once we're sure that the mxact we plan to call it on actually
exists on disk.  That won't be called until we replay the first
checkpoint, but that might still be prior to consistency.

Since I forgot to attach the revised patch with fixes for the points
Noah mentioned to that email, here it is attached to this one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
commit d33b4eb0167f465edb00bd6c0e1bcaa67dd69fe9
Author: Robert Haas rhaas@postgresql.org
Date:   Fri May 29 14:35:53 2015 -0400

foo

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9568ff1..aca829d 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -199,8 +199,9 @@ typedef struct MultiXactStateData
 	MultiXactOffset nextOffset;
 
 	/*
-	 * Oldest multixact that is still on disk.  Anything older than this
-	 * should not be consulted.  These values are updated by vacuum.
+	 * Oldest multixact that may still be referenced from a relation.
+	 * Anything older than this should not be consulted.  These values are
+	 * updated by vacuum.
 	 */
 	MultiXactId oldestMultiXactId;
 	Oid			oldestMultiXactDB;
@@ -213,6 +214,18 @@ typedef struct MultiXactStateData
 	 */
 	MultiXactId lastCheckpointedOldest;
 
+	/*
+	 * This is the oldest file that actually exist on the disk.  This value
+	 * is initialized by scanning pg_multixact/offsets, and subsequently
+	 * updated each time we complete a truncation.  We need a flag to
+	 * indicate whether this has been initialized yet.
+	 */
+	MultiXactId oldestMultiXactOnDisk;
+	bool		oldestMultiXactOnDiskValid;
+
+	/* Has TrimMultiXact been called yet? */
+	bool		didTrimMultiXact;
+
 	/* support for anti-wraparound measures */
 	MultiXactId multiVacLimit;
 	MultiXactId multiWarnLimit;
@@ -344,6 +357,8 @@ static char *mxstatus_to_string(MultiXactStatus status);
 /* management of SLRU infrastructure */
 static int	ZeroMultiXactOffsetPage(int pageno, bool writeXlog);
 static int	ZeroMultiXactMemberPage(int pageno, bool writeXlog);
+static MultiXactOffset GetOldestMultiXactOnDisk(void);
+static MultiXactOffset GetOldestReferencedOffset(MultiXactId oldestMXact);
 static bool MultiXactOffsetPagePrecedes(int page1, int page2);
 static bool

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Alvaro Herrera

Thomas Munro wrote:

  - There's a third possible problem related to boundary cases in
  SlruScanDirCbRemoveMembers, but I don't understand that one well
  enough to explain it.  Maybe Thomas can jump in here and explain the
  concern.
 
 I noticed something in passing which is probably not harmful, and not
 relevant to this bug report, it was just a bit confusing while
 testing: 

Another thing I noticed in passing is that
SimpleLruDoesPhysicalFileExist uses SLRU_OPEN_FAILED in slru_errcase for
the case where lseek() fails, rather than SLRU_SEEK_FAILED.  This is
probably completely harmless, because that's lseek(..., SEEK_END), but
seems rather odd anyway.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Alvaro Herrera

Alvaro Herrera wrote:
 Robert Haas wrote:

  In the process of investigating this, we found a few other things that
  seem like they may also be bugs:
  
  - As noted upthread, replaying an older checkpoint after a newer
  checkpoint has already happened may lead to similar problems.  This
  may be possible when starting from an online base backup; or when
  restarting a standby that did not perform a restartpoint when
  replaying the last checkpoint before the shutdown.
 
 I'm going through this one now, as it's closely related and caused
 issues for us.

FWIW I've spent a rather long while trying to reproduce the issue, but
haven't been able to figure out.  Thomas already commented on it: the
server notices that a file is missing and, instead of failing, it reads
the file as zeroes.  This is because of this hack in slru.c, which is
there to cover for a pg_clog replay consideration:

/*
 * In a crash-and-restart situation, it's possible for us to receive
 * commands to set the commit status of transactions whose bits are in
 * already-truncated segments of the commit log (see notes in
 * SlruPhysicalWritePage).  Hence, if we are InRecovery, allow the case
 * where the file doesn't exist, and return zeroes instead.
 */
fd = OpenTransientFile(path, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR);
if (fd  0)
{
if (errno != ENOENT || !InRecovery)
{
slru_errcause = SLRU_OPEN_FAILED;
slru_errno = errno;
return false;
}

ereport(LOG,
(errmsg(file \%s\ doesn't exist, reading as 
zeroes,
path)));
MemSet(shared-page_buffer[slotno], 0, BLCKSZ);
return true;
}

I was able to cause an actual problem by #ifdefing out the if
errno/inRcovery line, of course, but how can I be sure that fixing that
would also fix my customer's problem?  That's no good.

Anyway here's a quick script to almost-reproduce the problem.  I
constructed many variations of this, trying to find the failing one, to
no avail.  Note I'm using the pg_burn_multixact() function to create
multixacts quickly (this is cheating in itself, but it seems to me that
if I'm not able to reproduce the problem with this, I'm not able to do
so without it either.)  This script takes an exclusive backup
(cp -pr) excluding pg_multixact, then does some multixact stuff
(including truncation), then copies pg_multixact.  Evidently I'm missing
some critical element but I can't see what it is.

Another notable variant was to copy pg_multixact first, then do stuff
(create lots of multixacts, then mxact-freeze/checkpoint),
then copy the rest of the data dir.  No joy either: when replay starts,
the missing multixacts are created on the standby and so by the time we
reach the checkpoint record the pg_multixact/offset file has already
grown to the necessary size.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Alvaro Herrera

Alvaro Herrera wrote:

 Anyway here's a quick script to almost-reproduce the problem.

Meh.  Really attached now.

I also wanted to post the error messages we got:

2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode
2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: restored log file 
000173DD00AD from archive
2015-05-27 16:15:18 UTC [4782]: [5-1] user=,db= FATAL: could not access status 
of transaction 4624559
2015-05-27 16:15:18 UTC [4782]: [6-1] user=,db= DETAIL: Could not read from 
file pg_multixact/offsets/0046 at offset 147456: Success.
2015-05-27 16:15:18 UTC [4778]: [4-1] user=,db= LOG: startup process (PID 4782) 
exited with exit code 1
2015-05-27 16:15:18 UTC [4778]: [5-1] user=,db= LOG: aborting startup due to 
startup process failure

We pg_xlogdumped the offending segment and see that there is a
checkpoint record with oldestMulti=4624559.  Curiously, there are no
other records with rmgr=MultiXact in that segment.

Note that the file exists (otherwise the error would say could not
open rather than could not read); this is also why errno says
Success rather than some actual error; this is the code:

errno = 0;
if (read(fd, shared-page_buffer[slotno], BLCKSZ) != BLCKSZ)
{
slru_errcause = SLRU_READ_FAILED;
slru_errno = errno;
CloseTransientFile(fd);
return false;
}

My guess is that the file existed, and perhaps had one or more pages,
but the wanted page doesn't exist, so we tried to read but got 0 bytes
back.  read() returns 0 in this case but doesn't set errno.

I didn't find a way to set things so that the file exists but is of
shorter contents than oldestMulti by the time the checkpoint record is
replayed.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


repro-chkpt-replay-failure.sh
Description: Bourne shell script

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Noah Misch

On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote:
 On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch n...@leadboat.com wrote:
  On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote:
  SetMultiXactIdLimit() bracketed certain parts of its
  logic with if (!InRecovery), but those guards were ineffective because
  it gets called before InRecovery is set in the first place.
 
  SetTransactionIdLimit() checks InRecovery for the same things, and it is
  called at nearly the same moments as SetMultiXactIdLimit().  Do you have a
  sense of whether it is subject to similar problems as a result?
 
 Well, I think it's pretty weird that those things will get done before
 beginning recovery, even on an inconsistent cluster, but not during
 recovery.  That is pretty strange.  I don't see that it can actually
 do any worse than emit a few log messages at the start of recovery
 that won't show up again until the end of recovery, though.

Granted.  Would it be better to update both functions at the same time, and
perhaps to make that a master-only change?  Does the status quo cause more
practical harm via SetMultiXactIdLimit() than via SetTransactionIdLimit()?

  1. Moves the call to DetermineSafeOldestOffset() that appears in
  StartupMultiXact() to TrimMultiXact(), so that we don't try to do this
  until we're consistent.  Also, instead of passing
  MultiXactState-oldestMultiXactId, pass the newer of that value and
  the earliest offset that exists on disk.  That way, it won't try to
  read data that's not there.
 
  Perhaps emit a LOG message when we do that, since it's our last opportunity 
  to
  point to the potential data loss?
 
 If the problem is just that somebody minmxid got set to 1 instead of
 the real value, I think that there is no data loss, because none of
 those older values are actually present there.  But we could add a LOG
 message anyway.  How do you suggest that we phrase that?

There's no data loss if 1 = true_minmxid = 2^31 at the time minmxid got set
to 1.  Otherwise, data loss is possible.  I don't hope for an actionable
message, but we might want a reporter to grep logs for it when we diagnose
future reports.  Perhaps this:

  missing pg_multixact/members files; begins at MultiXactId %u, expected %u

For good measure, perhaps emit this when lastCheckpointedOldest  earliest by
more than one segment:

  excess pg_multixact/members files; begins at MultiXactId %u, expected %u

  @@ -2859,6 +2947,14 @@ TruncateMultiXact(void)
SimpleLruTruncate(MultiXactOffsetCtl,
  
  MultiXactIdToOffsetPage(oldestMXact));
 
  + /* Update oldest-on-disk value in shared memory. */
  + earliest = range.rangeStart * MULTIXACT_OFFSETS_PER_PAGE;
  + if (earliest  FirstMultiXactId)
  + earliest = FirstMultiXactId;
  + LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
  + Assert(MultiXactState-oldestMultiXactOnDiskValid);
  + MultiXactState-oldestMultiXactOnDiskValid = earliest;
 
  That last line needs s/Valid//, I presume.  Is it okay that
  oldestMultiXactOnDisk becomes too-old during TruncateMultiXact(), despite 
  its
  Valid indicator remaining true?
 
 Ay yai yay.  Yes, s/Valid//.
 
 I'm not sure what you mean about it becoming too old.  At least with
 that fix, it should get updated to exactly the first file that we
 didn't remove.  Isn't that right?

Consider a function raw_GOMXOD() that differs from GetOldestMultiXactOnDisk()
only in that it never reads or writes the cache.  I might expect
oldestMultiXactOnDisk==raw_GOMXOD() if oldestMultiXactOnDiskValid, and that
does hold most of the time.  It does not always hold between the start of the
quoted code's SimpleLruTruncate() and its oldestMultiXactOnDisk assignment.
That's because raw_GOMXOD() changes at the instant we unlink the oldest
segment, but oldestMultiXactOnDisk changes later.  Any backend may call
GetOldestMultiXactOnDisk() via SetMultiXactIdLimit().  If a backend does that
concurrent with the checkpointer running TruncateMultiXact() and sees a stale
oldestMultiXactOnDisk, is that harmless?

  +static MultiXactOffset
  +GetOldestMultiXactOnDisk(void)
  +{
 
  + SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, 
  trunc);
  + earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE;
  + if (earliest  FirstMultiXactId)
  + earliest = FirstMultiXactId;
 
  SlruScanDirCbFindEarliest() is only meaningful if the files on disk do not
  represent a wrapped state.  When the files do represent a wrapped state,
  MultiXactIdPrecedes() is not transitive, and the SlruScanDirCbFindEarliest()
  result is sensitive to readdir() order.  This code exists specifically to 
  help
  us make do in the face of wrong catalog and pg_control entries.  We may have
  wrapped as a result of those of same catalog/pg_control entries, so I think
  this function ought to account for wrap.  I haven't given enough thought to
  what exactly it should

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-31 Thread Noah Misch

On Fri, May 29, 2015 at 10:37:57AM +1200, Thomas Munro wrote:
 On Fri, May 29, 2015 at 7:56 AM, Robert Haas robertmh...@gmail.com wrote:
  - There's a third possible problem related to boundary cases in
  SlruScanDirCbRemoveMembers, but I don't understand that one well
  enough to explain it.  Maybe Thomas can jump in here and explain the
  concern.
 
 I noticed something in passing which is probably not harmful, and not
 relevant to this bug report, it was just a bit confusing while
 testing:  SlruScanDirCbRemoveMembers never deletes any files if
 rangeStart == rangeEnd.  In practice, if you have an idle cluster with
 a lot of multixact data and you VACUUM FREEZE all databases and then
 CHECKPOINT, you might be surprised to see no member files going away
 quite yet, but they'll eventually be truncated by a future checkpoint,
 once rangeEnd has had a chance to advance to the next page due to more
 multixacts being created.
 
 If we want to fix this one day, maybe the right thing to do is to
 treat the rangeStart == rangeEnd case the same way we treat rangeStart
  rangeEnd, that is, to assume that the range of pages isn't
 wrapped/inverted in this case.

I agree.  Because we round rangeStart down to a segment boundary, oldest and
next member offsets falling on the same page typically implies
rangeStartrangeEnd.  Only when the page they share happens to be the first
page of a segment does one observe rangeStart==RangeEnd.

While testing this (with inconsistent-multixact-fix-master.patch applied,
FWIW), I noticed a nearby bug with a similar symptom.  TruncateMultiXact()
omits the nextMXact==oldestMXact special case found in each other
find_multixact_start() caller, so it reads the offset of a not-yet-created
MultiXactId.  The usual outcome is to get rangeStart==0, so we truncate less
than we could.  This can't make us truncate excessively, because
nextMXact==oldestMXact implies no table contains any mxid.  If nextMXact
happens to be the first of a segment, an error is possible.  Procedure:

1. Make a fresh cluster.
2. UPDATE pg_database SET datallowconn = true
3. Consume precisely 131071 mxids.  Number of offsets per mxid is unimportant.
4. vacuumdb --freeze --all

Expected state after those steps:
$ pg_controldata | grep NextMultiXactId
Latest checkpoint's NextMultiXactId:  131072

Checkpoint will fail like this:
26699 2015-05-31 17:22:33.134 GMT LOG:  statement: checkpoint
26661 2015-05-31 17:22:33.134 GMT DEBUG:  performing replication slot checkpoint
26661 2015-05-31 17:22:33.136 GMT ERROR:  could not access status of 
transaction 131072
26661 2015-05-31 17:22:33.136 GMT DETAIL:  Could not open file 
pg_multixact/offsets/0002: No such file or directory.
26699 2015-05-31 17:22:33.234 GMT ERROR:  checkpoint request failed
26699 2015-05-31 17:22:33.234 GMT HINT:  Consult recent messages in the server 
log for details.
26699 2015-05-31 17:22:33.234 GMT STATEMENT:  checkpoint

This does not block startup, and creating one mxid hides the problem again.
Thus, it is not a top-priority bug like some other parts of this thread.  I
mention it today mostly so it doesn't surprise hackers testing other fixes.

Thanks,
nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-31 Thread Noah Misch

Incomplete review, done in a relative rush:

On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote:
 OK, here's a patch.  Actually two patches, differing only in
 whitespace, for 9.3 and for master (ha!).  I now think that the root
 of the problem here is that DetermineSafeOldestOffset() and
 SetMultiXactIdLimit() were largely ignorant of the possibility that
 they might be called at points in time when the cluster was
 inconsistent.

A cause perhaps closer to the root is commit f741300 moving truncation from
VACUUM to checkpoints.  CLOG has given us deep experience with VACUUM-time
truncation.  Commit f6a6c46d and this patch are about bringing CHECKPOINT-time
truncation up to the same level.

Speaking of commit f6a6c46d, it seems logical that updating allocation stop
limits should happen proximate to truncation.  That's currently the case for
CLOG (vac_truncate_clog() does both) and pg_multixact/members (checkpoint's
TruncateMultiXact() call does both).  However, pg_multixact/offsets is
truncated from TruncateMultiXact(), but vac_truncate_clog() updates its limit.
I did not distill an errant test case, but this is fishy.

 SetMultiXactIdLimit() bracketed certain parts of its
 logic with if (!InRecovery), but those guards were ineffective because
 it gets called before InRecovery is set in the first place.

SetTransactionIdLimit() checks InRecovery for the same things, and it is
called at nearly the same moments as SetMultiXactIdLimit().  Do you have a
sense of whether it is subject to similar problems as a result?

 1. Moves the call to DetermineSafeOldestOffset() that appears in
 StartupMultiXact() to TrimMultiXact(), so that we don't try to do this
 until we're consistent.  Also, instead of passing
 MultiXactState-oldestMultiXactId, pass the newer of that value and
 the earliest offset that exists on disk.  That way, it won't try to
 read data that's not there.

Perhaps emit a LOG message when we do that, since it's our last opportunity to
point to the potential data loss?

 +  * PostgreSQL 9.3.0 through 9.3.6 and PostgreSQL 9.4.0 through 9.4.1
 +  * had bugs that could allow users who reached those release through

s/release/releases/

 @@ -2859,6 +2947,14 @@ TruncateMultiXact(void)
   SimpleLruTruncate(MultiXactOffsetCtl,
 MultiXactIdToOffsetPage(oldestMXact));
  
 + /* Update oldest-on-disk value in shared memory. */
 + earliest = range.rangeStart * MULTIXACT_OFFSETS_PER_PAGE;
 + if (earliest  FirstMultiXactId)
 + earliest = FirstMultiXactId;
 + LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
 + Assert(MultiXactState-oldestMultiXactOnDiskValid);
 + MultiXactState-oldestMultiXactOnDiskValid = earliest;

That last line needs s/Valid//, I presume.  Is it okay that
oldestMultiXactOnDisk becomes too-old during TruncateMultiXact(), despite its
Valid indicator remaining true?

 +static MultiXactOffset
 +GetOldestMultiXactOnDisk(void)
 +{

 + SlruScanDirectory(MultiXactOffsetCtl, SlruScanDirCbFindEarliest, 
 trunc);
 + earliest = trunc.earliestExistingPage * MULTIXACT_OFFSETS_PER_PAGE;
 + if (earliest  FirstMultiXactId)
 + earliest = FirstMultiXactId;

SlruScanDirCbFindEarliest() is only meaningful if the files on disk do not
represent a wrapped state.  When the files do represent a wrapped state,
MultiXactIdPrecedes() is not transitive, and the SlruScanDirCbFindEarliest()
result is sensitive to readdir() order.  This code exists specifically to help
us make do in the face of wrong catalog and pg_control entries.  We may have
wrapped as a result of those of same catalog/pg_control entries, so I think
this function ought to account for wrap.  I haven't given enough thought to
what exactly it should do.

Thanks,
nm


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-31 Thread Robert Haas

On Sat, May 30, 2015 at 8:55 PM, Andres Freund and...@anarazel.de wrote:
 Is oldestMulti, nextMulti - 1 really suitable for this? Are both
 actually guaranteed to exist in the offsets slru and be valid?  Hm. I
 guess you intend to simply truncate everything else, but just in
 offsets?

oldestMulti in theory is the right thing, I think, but in actuality we
know that some people have 1 here instead of the correct value.

 One argument against this idea is that we may not want to keep a full
 set of member files on standbys (due to disk space usage), but that's
 what will happen unless we truncate during replay.

 I think that argument is pretty much the death-knell.=

Yes.  Truncating on the standby is really not optional.

  I think at least for 9.5+ we should a) invent proper truncation records
  for pg_multixact b) start storing oldestValidMultiOffset in pg_control.
  The current hack of scanning the directories to get knowledge we should
  have is a pretty bad hack, and we should not continue using it forever.
  I think we might end up needing to do a) even in the backbranches.

 Definitely agree with WAL-logging truncations; also +1 on backpatching
 that to 9.3.  We already have experience with adding extra WAL records
 on minor releases, and it didn't seem to have bitten too hard.

 I'm inclined to agree. My only problem is that I'm not sure whether we
 can find a way of doing all this without adding a pg_control field. Let
 me try to sketch this out:

 1) We continue determining the oldest 
 SlruScanDirectory(SlruScanDirCbFindEarliest)
on the master to find the oldest offsets segment to
truncate. Alternatively, if we determine it to be safe, we could use
oldestMulti to find that.
 2) SlruScanDirCbRemoveMembers is changed to return the range of members
to remove, instead of doing itself
 3) We wal log [oldest offset segment guaranteed to not be alive,
nextmulti) for offsets, and [oldest members segment guaranteed to not be 
 alive,
nextmultioff), and redo truncations for the entire range during
recovery.

 I'm pretty tired right now, but this sounds doable.

I'm probably biased here, but I think we should finish reviewing,
testing, and committing my patch before we embark on designing this.
So far we have no reports of trouble attributable to the lack of the
WAL-logging support discussed here, as opposed to several reports of
trouble from the status quo within days of release.

I'm having trouble reconstructing the series of events where what
you're worried about here really becomes a problem, and I think we
ought to have a very clear idea about that before back-patching
changes of this type.  It's true that if the state of the SLRU
directory is in the future, because recovery was restarted from an
earlier checkpoint, we might replay a checkpoint and remove some of
those files from the future.  But so what?  By the time we've reached
the minimum recovery point, they will have been recreated by the same
WAL records that created them in the first place.  If, in the previous
replay, we had wrapped all the way around, some of the stuff we keep
may actually already have been overwritten by future WAL records, but
they'll be overwritten again.  Now, that could mess up our
determination of which members to remove, I guess, but I'm not clear
that actually matters either: if the offsets space has wrapped around,
the members space will certainly have wrapped around as well, so we
can remove anything we like at this stage and we're still OK.  I agree
this is ugly the way it is, but where is the actual bug?

As far as your actual outline goes, I think if we do this, we need to
be very careful about step #2.  Right now, we decide what we need to
keep and then remove everything else, but that's kind of wonky because
new stuff may be getting created at the same time, so we keep
adjusting our idea of exactly what needs to be removed.  It would be
far better to invert that logic: decide what needs to be removed -
presumably, everything from the oldest member that now exists up until
some later point - and then remove precisely that stuff and nothing
else.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Christoph Berg

Re: Robert Haas 2015-05-29 
ca+tgmozzdjn38tfqydgagj-ap+zkrqsrgbq4eu_zrefryk+...@mail.gmail.com
  FTR: Robert, you have been a Samurai on this issue. Our many thanks.
 
 Thanks!  I really appreciate the kind words.

I'm still watching with admiration. This list of steps-to-reproduce is
the longest and at the same time best I've ever seen.

If anyone ever asks me why debugging in open source projects is better
than closed source, I'll point them to this thread.

Christoph
-- 
c...@df7cb.de | http://www.df7cb.de/


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On Fri, May 29, 2015 at 11:24 AM, Robert Haas robertmh...@gmail.com wrote:
 A. Most obviously, we should fix pg_upgrade so that it installs
 chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so
 that we stop creating new instances of this problem.  That won't get
 us out of the hole we've dug for ourselves, but we can at least try to
 stop digging.  (This is assuming I'm right that chkpnt_nxtmulti is the
 wrong thing - anyone want to double-check me on that one?)

Yes, it seems like this could lead to truncation of multixacts still
referenced by tuples, leading to errors when updating, locking,
vacuuming.  Why don't we have reports of that?

 B. We need to change find_multixact_start() to fail softly.  This is
 important because it's legitimate for it to fail in recovery, as
 discussed upthread, and also because we probably want to eliminate the
 fail-to-start hazard introduced in 9.4.2 and 9.3.7.
 find_multixact_start() is used in three places, and they each require
 separate handling:

Here is an experimental WIP patch that changes StartupMultiXact and
SetMultiXactIdLimit to find the oldest multixact that exists on disk
(by scanning the directory), and uses that if it is more recent than
the oldestMultiXactId from shmem, when calling
DetermineSafeOldestOffset.  I'm not all that happy with it, see below,
but let me know what you think.

Using unpatched master, I reproduced the startup error with a bit of a
short cut:

1.  initdb, generate enough multixacts to get more than one offsets file
2.  ALTER DATABASE template0 ALLOW_CONNECTION = true;, vacuumdb
--freeze --all, CHECKPOINT
3.  verify that pg_control now holds a large oldestMultiXactId, and
note NextMultiXactId
4.  shutdown, pg_resetxlog -m (NextMultiXactId from 3),1 pg_data
5.  start up: fails

Apply this patch, and it starts up successfully.

What are the repro steps for the replay problem?  Is a basebackup of a
large database undergoing truncation and some good timing needed?

 - In SetMultiXactIdLimit, find_multixact_start() is used to set
 MultiXactState-oldestOffset, which is used to determine how
 aggressively to vacuum.  If find_multixact_start() fails, we don't
 know how aggressively we need to vacuum to prevent members wraparound;
 it's probably best to decide to vacuum as aggressively as possible.
 Of course, if we're in recovery, we won't vacuum either way; the fact
 that it fails softly is good enough.

Isn't it enough to use the start offset for the most recent of the
oldest multixact ID and the oldest multixact found by scanning
pg_multixact/offsets?  In this patch, it does that, but I'm not happy
with the time the work is done, it just doesn't seem right for
SetMultiXactIdLimit to be scanning that directory.  The result of that
operation should only change when files have been truncated anyway,
and the truncate code was already doing a filesystem scan.  Maybe the
truncate code should store the earliest multixact ID found on disk in
shared memory, so that SetMultiXactIdLimit can use it for free.  I
tried to get that working but couldn't figure out where it should be
initialised -- StartupMultiXact is too late (StartupXLOG calls
SetMultiXactIdLimit before that), but BootstrapMultiXact and
MultiXactShmemInit didn't seem like the right places either.

 - In DetermineSafeOldestOffset, find_multixact_start() is used to set
 MultiXactState-offsetStopLimit.  If it fails here, we don't know when
 to refuse multixact creation to prevent wraparound.  Again, in
 recovery, that's fine.  If it happens in normal running, it's not
 clear what to do.  Refusing multixact creation is an awfully blunt
 instrument.  Maybe we can scan pg_multixact/offsets to determine a
 workable stop limit: the first file greater than the current file that
 exists, minus two segments, is a good stop point.  Perhaps we ought to
 use this mechanism here categorically, not just when
 find_multixact_start() fails.  It might be more robust than what we
 have now.

Done in this patch -- the truncate code calls
DetermineSafeOldestOffset with the earliest SLRU found by scanning if
that's more recent than the shmem value, and then
DetermineSafeOldestOffset applies the step-back-one-whole-segment
logic to that as before.

 - In TruncateMultiXact, find_multixact_start() is used to set the
 truncation point for the members SLRU.  If it fails here, I'm guessing
 the right solution is not to truncate anything - instead, rely on
 intense vacuuming to eventually advance oldestMXact to a value whose
 member data still exists; truncate then.

TruncateMultiXact already contained logic to do nothing at all if
oldestMXact is older than the earliest it can find on disk.  I moved
that code into find_earliest_multixact_on_disk() to be able to use it
elsewhere too, in this patch.

 C. I think we should also change TruncateMultiXact() to truncate
 offsets first, and then members.  As things stand, if we truncate
 members first, we increase the risk of seeing an offset that will fail

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Tom Lane

Thomas Munro thomas.mu...@enterprisedb.com writes:
 On Fri, May 29, 2015 at 11:24 AM, Robert Haas robertmh...@gmail.com wrote:
 B. We need to change find_multixact_start() to fail softly.

 Here is an experimental WIP patch that changes StartupMultiXact and
 SetMultiXactIdLimit to find the oldest multixact that exists on disk
 (by scanning the directory), and uses that if it is more recent than
 the oldestMultiXactId from shmem,

Not sure about the details of this patch, but I was planning to propose
what I think is the same thing: the way to make find_multixact_start()
fail softly is to have it find the oldest actually existing file if the
one that should be there isn't.

This would preserve the idea that we aren't trusting the multixact
tracking data to be completely right, which was the point of 78db307bb
and which evidently is still essential.  It would also obviate the need
for other places to contain similar logic.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On Fri, May 29, 2015 at 10:17 AM, Tom Lane t...@sss.pgh.pa.us wrote:
 Thomas Munro thomas.mu...@enterprisedb.com writes:
 On Fri, May 29, 2015 at 11:24 AM, Robert Haas robertmh...@gmail.com wrote:
 B. We need to change find_multixact_start() to fail softly.

 Here is an experimental WIP patch that changes StartupMultiXact and
 SetMultiXactIdLimit to find the oldest multixact that exists on disk
 (by scanning the directory), and uses that if it is more recent than
 the oldestMultiXactId from shmem,

 Not sure about the details of this patch, but I was planning to propose
 what I think is the same thing: the way to make find_multixact_start()
 fail softly is to have it find the oldest actually existing file if the
 one that should be there isn't.

Working on that now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On Sat, May 30, 2015 at 10:48 AM, Andres Freund and...@anarazel.de wrote:
 On 2015-05-30 10:41:01 +1200, Thomas Munro wrote:
 On Sat, May 30, 2015 at 10:29 AM, Robert Haas robertmh...@gmail.com wrote:
  On Fri, May 29, 2015 at 5:14 PM, Josh Berkus j...@agliodbs.com wrote:
  Just saw what looks like a report of this issue on 9.2.
 
  https://github.com/wal-e/wal-e/issues/177
 
  Urk.  That looks awfully similar, but I don't think any of the code
  that is affected here exists in 9.2, or that any of the fixes involved
  were back-patched to 9.2.  So that might be something else altogether.

 Not only that, the pg_multixact/members *directory* is reported
 missing, which is a different problem entirely.

 I don't read that from the log, what does make you think that that's the
 case. Afaics it's just the 0146 member file that is missing?

That's the error message, but then further down:

I have confirmed that directory pg_multixact/members does not
existing in the restored data directory.

I can see this directory and the file if i restore a few days old
backup. I have used WAL-E backups a number of time from this server
but this is the first time I am running into this issue.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On Fri, May 29, 2015 at 5:14 PM, Josh Berkus j...@agliodbs.com wrote:
 Just saw what looks like a report of this issue on 9.2.

 https://github.com/wal-e/wal-e/issues/177

Urk.  That looks awfully similar, but I don't think any of the code
that is affected here exists in 9.2, or that any of the fixes involved
were back-patched to 9.2.  So that might be something else altogether.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On 2015-05-30 10:55:30 +1200, Thomas Munro wrote:
 That's the error message, but then further down:

Ooops.

 I have confirmed that directory pg_multixact/members does not
 existing in the restored data directory.

 I can see this directory and the file if i restore a few days old
 backup. I have used WAL-E backups a number of time from this server
 but this is the first time I am running into this issue.

That sounds like a WAL-E problem rather than a postgres one then. I
think it's exceedingly unlikely - although I perhaps should be careful
with that given the recent history - that postgres contains a bug that
deletes an entire directory on the standby/while cloning.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On 2015-05-30 10:41:01 +1200, Thomas Munro wrote:
 On Sat, May 30, 2015 at 10:29 AM, Robert Haas robertmh...@gmail.com wrote:
  On Fri, May 29, 2015 at 5:14 PM, Josh Berkus j...@agliodbs.com wrote:
  Just saw what looks like a report of this issue on 9.2.
 
  https://github.com/wal-e/wal-e/issues/177
 
  Urk.  That looks awfully similar, but I don't think any of the code
  that is affected here exists in 9.2, or that any of the fixes involved
  were back-patched to 9.2.  So that might be something else altogether.
 
 Not only that, the pg_multixact/members *directory* is reported
 missing, which is a different problem entirely.

I don't read that from the log, what does make you think that that's the
case. Afaics it's just the 0146 member file that is missing?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On Sat, May 30, 2015 at 10:29 AM, Robert Haas robertmh...@gmail.com wrote:
 On Fri, May 29, 2015 at 5:14 PM, Josh Berkus j...@agliodbs.com wrote:
 Just saw what looks like a report of this issue on 9.2.

 https://github.com/wal-e/wal-e/issues/177

 Urk.  That looks awfully similar, but I don't think any of the code
 that is affected here exists in 9.2, or that any of the fixes involved
 were back-patched to 9.2.  So that might be something else altogether.

Not only that, the pg_multixact/members *directory* is reported
missing, which is a different problem entirely.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On 2015-05-29 15:49:53 -0400, Bruce Momjian wrote:
 I think we need to step back and look at the brain power required to
 unravel the mess we have made regarding multi-xact and fixes.  (I bet
 few people can even remember which multi-xact fixes went into which
 releases --- I can't.)  Instead of working on actual features, we are
 having to do this complex diagnosis because we didn't do a thorough
 analysis at the time a pattern of multi-xact bugs started to appear. 
 Many projects deal with this compound bug debt regularly, but we have
 mostly avoided this fate.

What is the consequences of that observation you're imagining?


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On Fri, May 29, 2015 at 3:08 PM, Robert Haas robertmh...@gmail.com wrote:
 It won't fix the fact that pg_upgrade is putting
 a wrong value into everybody's datminmxid field, which should really
 be addressed too, but I've been working on this for about three days
 virtually non-stop and I don't have the energy to tackle it right now.
 If anyone feels the urge to step into that breech, I think what it
 needs to do is: when upgrading from a 9.3-or-later instance, copy over
 each database's datminmxid into the corresponding database in the new
 cluster.

Bruce was kind enough to spend some time on IM with me this afternoon,
and I think this may actually be OK.  What pg_upgrade does is:

1. First, put next-xid into the relminmxid for all tables, including
catalog tables.  This is the correct behavior for upgrades from a
pre-9.3 release, and is correct for catalog tables in general.

2. Next, restoring the schema dump will set the relminmxid values for
all non-catalog tables to the value dumped from the old cluster.  At
this point, everything is fine provided that we are coming from a
release 9.3 or newer.  But if the old cluster is pre-9.3, it will have
dumped *zero* values for all of its relminmxid values; so all of the
user tables go from the correct value they had after step 1 to an
incorrect value.

3. Finally, if the old cluster is pre-9.3, repeat step 1, undoing the
damage done in step 2.

This is a bit convoluted, but I don't know of a reason why it
shouldn't work.  Sorry for the false alarm.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Alvaro Herrera

Andres Freund wrote:

 I considered for a second whether the solution for that could be to not
 truncate while inconsistent - but I think that doesn't solve anything as
 then we can end up with directories where every single offsets/member
 file exists.

Hang on a minute.  We don't need to scan any files to determine the
truncate point for offsets; we have the valid range for them in
pg_control, as nextMulti + oldestMulti.  And using those end points, we
can look for the offsets corresponding to each, and determine the member
files corresponding to the whole set; it doesn't matter what other files
exist, we just remove them all.  In other words, maybe we can get away
with considering truncation separately for offset and members on
recovery: do it like today for offsets (i.e. at each restartpoint), but
do it only in TrimMultiXact for members.

One argument against this idea is that we may not want to keep a full
set of member files on standbys (due to disk space usage), but that's
what will happen unless we truncate during replay.

 I think at least for 9.5+ we should a) invent proper truncation records
 for pg_multixact b) start storing oldestValidMultiOffset in pg_control.
 The current hack of scanning the directories to get knowledge we should
 have is a pretty bad hack, and we should not continue using it forever.
 I think we might end up needing to do a) even in the backbranches.

Definitely agree with WAL-logging truncations; also +1 on backpatching
that to 9.3.  We already have experience with adding extra WAL records
on minor releases, and it didn't seem to have bitten too hard.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Alvaro Herrera

Bruce Momjian wrote:

 I think we need to step back and look at the brain power required to
 unravel the mess we have made regarding multi-xact and fixes.  (I bet
 few people can even remember which multi-xact fixes went into which
 releases --- I can't.)  Instead of working on actual features, we are
 having to do this complex diagnosis because we didn't do a thorough
 analysis at the time a pattern of multi-xact bugs started to appear. 
 Many projects deal with this compound bug debt regularly, but we have
 mostly avoided this fate.

Well, it's pretty obvious that if we had had a glimpse of the nature of
the issues back then, we wouldn't have committed the patch.  The number
of ends that we left loose we now know to be huge, but we didn't know
that back then.  (I, at least, certainly didn't.)

Simon told me when this last one showed up that what we need at this
point is a way to turn the whole thing off to stop it from affecting
users anymore.  I would love to be able to do that, because the whole
situation has become stressing, but I don't see a way.  Heck, if we
could implement Heikki's TED idea or something similar, I would be up
for back-patching it so that people can pg_upgrade from postgresql-9.3
to postgresql-ted-9.3, and just forget any further multixact pain.
Don't think that's really doable, though.  As far as I can see, for
branches 9.3 and 9.4 the best we can do is soldier on and get these bugs
fixed, hoping that this time they are really the last [serious] ones.

For 9.5, I concur with Andres that we'd do good to change the way
truncations are done by WAL-logging more stuff and keep more data in
pg_control, to avoid all these nasty games.  And for 9.6, find a better
representation of the data so that the durable data is stored separately
from the volatile data.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On Fri, May 29, 2015 at 9:46 PM, Andres Freund and...@anarazel.de wrote:
 On 2015-05-29 15:08:11 -0400, Robert Haas wrote:
 It seems pretty clear that we can't effectively determine anything
 about member wraparound until the cluster is consistent.

 I wonder if this doesn't actually hints at a bigger problem.  Currently,
 to determine where we need to truncate SlruScanDirectory() is
 used. That, afaics, could actually be a problem during recovery when
 we're not consistent.

I agree.  I actually meant to mention this in my previous email, but,
owing to exhaustion and burnout, didn't.

 I think at least for 9.5+ we should a) invent proper truncation records
 for pg_multixact b) start storing oldestValidMultiOffset in pg_control.
 The current hack of scanning the directories to get knowledge we should
 have is a pretty bad hack, and we should not continue using it forever.
 I think we might end up needing to do a) even in the backbranches.

That may be the right thing to do.  I'm concerned that changing the
behavior of master too much will make it every subsequent fix twice as
hard, because we'll have to do one fix in master and another fix in
the back-branches.  I'm also concerned that it will create even more
convoluted failure scenarios. The failure-to-start problem discussed
on this thread requires a chain of four (maybe three) different
PostgreSQL versions in order to create it, and the more things we go
change, the harder it's going to be to reason about this stuff.

The diseased and rotting elephant in the room here is that clusters
with bogus relminmxid, datminmxid, and/or oldestMultiXid values may
exist in the wild and we really have no plan to get rid of them.
78db307bb may have helped somewhat - although I'm haven't grokked what
it's about well enough to be sure - but it's certainly not a complete
solution, as this bug report itself illustrates rather well.  Unless
we figure out some clever solution that is not now apparent to me, or
impose a hard pg_upgrade compatibility break at some point, we
basically can't count on pg_control's oldest multixact information
to be correct ever again.  We may be running into clusters 15 years
from now that have problems that are just holdovers from what was
fixed in 9.3.5.

One thing I think we should definitely do is add one or two additional
fields to pg_controldata that get filled in by pg_upgrade.  One of
them should be the oldest known catversion in the lineage of this
cluster and the other should be the most recent catverson in the
lineage of this cluster before this one.   Or maybe we should store
PG_VERSION_NUM values.  Or store both things.  I think that would make
troubleshooting this kind of problem a lot easier - just from the
pg_controldata output, you'd be able to tell whether the cluster had
been pg_upgraded, whether it had been pg_upgraded once or multiple
times, and at least some of the versions involved, without relying on
the user's memory of what they did and when.  Fortunately, Steve
Kellet had a pretty clear idea of what his history was, but not all
users know that kind of thing, and I've wanted it more than once while
troubleshooting.

Another thing I think we should do is add a field to pg_class that is
propagated by pg_upgrade and stores the most recent PG_VERSION_NUM
that is known to have performed a scan_all vacuum of the table.  This
would allow us to do things in the future like (a) force a full-table
vacuum of any table that hasn't been vacuumed since $BUGGYRELEASE or
(b) advise users to manually inspect the values and manually perform
said vacuum or (c) only believe that certain information about a table
is accurate if it's been full-scanned by a vacuum newer than
$BUGGYRELEASE.  It could also be used as part of a strategy for
reclaiming HEAP_MOVED_IN/HEAP_MOVED_OFF; e.g. you can't upgrade to
10.5, which repurposes those bits, unless you've done a scan_all
vacuum on every table with a release new enough to guarantee that
they're not used for their historical purpose.

 This problem isn't conflicting with most of the fixes you describe, so
 I'll continue with reviewing those.

Thank you.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On 2015-05-29 15:08:11 -0400, Robert Haas wrote:
 It seems pretty clear that we can't effectively determine anything
 about member wraparound until the cluster is consistent.

I wonder if this doesn't actually hints at a bigger problem.  Currently,
to determine where we need to truncate SlruScanDirectory() is
used. That, afaics, could actually be a problem during recovery when
we're not consistent.

Consider the scenario where a very large database is copied while
running. At the start of the backup we'll determine at which checkpoint
recovery will start and store it in the label. After that the copy will
start, copying everything slowly. That works because we expect recovery
to fix things up.  The problem I see WRT multixacts is that the copied
state of pg_multixact could be wildly different from the one at the
label's checkpoint. During recovery, before reaching the first
checkpoint, we'll create multixact files as used at the time of the
checkpoint. But the rest of pg_multixact may be ahead 2**31 xacts.

For clog and such that's not a problem because the truncation
etc. points are all stored in WAL - during recovery we just replay the
truncations that happened on the master, there's no need to look at the
data directory. And we won't access the clog before being consistent.

But for multixacts is different. To avoid ending up with
pg_multixact/*/* directories we have to do truncations during
recovery. As there's currently no truncation records we have to do that
scanning the data directory. But that state could be from the future.

I considered for a second whether the solution for that could be to not
truncate while inconsistent - but I think that doesn't solve anything as
then we can end up with directories where every single offsets/member
file exists.  We could possibly try to fix that by always truncating
away slru segments in offsets that we know to be too old to exist in a
valid database. But achieving the same for members fries my brain.  It
also seems awfully risky.

I think at least for 9.5+ we should a) invent proper truncation records
for pg_multixact b) start storing oldestValidMultiOffset in pg_control.
The current hack of scanning the directories to get knowledge we should
have is a pretty bad hack, and we should not continue using it forever.
I think we might end up needing to do a) even in the backbranches.


Am I missing something?


This problem isn't conflicting with most of the fixes you describe, so
I'll continue with reviewing those.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On Sat, May 30, 2015 at 1:46 PM, Andres Freund and...@anarazel.de wrote:
 On 2015-05-29 15:08:11 -0400, Robert Haas wrote:
 It seems pretty clear that we can't effectively determine anything
 about member wraparound until the cluster is consistent.

 I wonder if this doesn't actually hints at a bigger problem.  Currently,
 to determine where we need to truncate SlruScanDirectory() is
 used. That, afaics, could actually be a problem during recovery when
 we're not consistent.

 Consider the scenario where a very large database is copied while
 running. At the start of the backup we'll determine at which checkpoint
 recovery will start and store it in the label. After that the copy will
 start, copying everything slowly. That works because we expect recovery
 to fix things up.  The problem I see WRT multixacts is that the copied
 state of pg_multixact could be wildly different from the one at the
 label's checkpoint. During recovery, before reaching the first
 checkpoint, we'll create multixact files as used at the time of the
 checkpoint. But the rest of pg_multixact may be ahead 2**31 xacts.

Yes, I think the code in TruncateMultiXact that scans for the earliest
multixact only works when the segment files span at most 2^31 of
multixact space. If they span more than that, MultiXactIdPrecedes is
no long able to provide a total ordering, so of the scan may be wrong,
depending on the order that it encounters the files.

Incidentally, your description of that scenario gave me an idea for
how to reproduce a base backup that 9.4.2 or master can't start.  I
tried this first:

1.  Set up with max_wal_senders = 1, wal_level = hot_standby, initdb
2.  Create enough multixacts to fill a couple of segments in
pg_multixacts/offsets using explode_mxact_members 99 1000 (create
foo table first)
3.  Start a base backup with logs, but break in
src/backend/replication/basebackup.c after
sendFileWithContent(BACKUP_LABEL_FILE, labelfile); and before sending
the contents of the data dir (including pg_multixacts)... (or just put
a big sleep in there)
4.  UPDATE pg_database SET datallowconn = true; vacuumdb --freeze
--all; CHECKPOINT;, see that offsets/ is now gone and
oldestMultiXid is 98001 in pg_control
5.  ... allow the server backend to continue; the basebackup completes.

Inspecting the new data directory, I see that offsets/ is not
present as expected, and pg_control contains the oldestMultiXid 98001.

Since pg_control was copied after pg_multixacts and my database didn't
move between those copies, it points to a valid multixact (unlike the
pg_upgrade scenario) and is able to start up, but does something
different again which may or may not be good, I'm not sure:

LOG:  database system was interrupted; last known up at 2015-05-30 14:30:23 NZST
LOG:  file pg_multixact/offsets/ doesn't exist, reading as zeroes
LOG:  redo starts at 0/728
LOG:  consistent recovery state reached at 0/70C8898
LOG:  redo done at 0/70C8898
LOG:  last completed transaction was at log time 2015-05-30 14:30:17.261436+12
LOG:  database system is ready to accept connections

My next theory about how to get a FATAL during startup is something
like this:  Break in basebackup.c in between copying pg_multixacts and
copying pg_control (simulating a very large/slow file copy, perhaps if
'base' happens to get copied after 'pg_multixacts', though I don't
know if that's possible), and while it's stopped, generate some
offsets segments, vacuum --freeze --all, checkpoint and then create a
few more multixacts, then checkpoint again (so that oldestMultiXact is
not equal to nextMultiXact).  Continue.  Now pg_control's
oldestMultiXactId now points at a segment file that didn't exist when
pg_multixacts was copied.  I haven't managed to get this to work (ie
produce a FATAL) and I'm out of time for a little while, but wanted to
share this idea in case it helps someone.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On Fri, May 29, 2015 at 12:43 PM, Robert Haas robertmh...@gmail.com wrote:
 Working on that now.

OK, here's a patch.  Actually two patches, differing only in
whitespace, for 9.3 and for master (ha!).  I now think that the root
of the problem here is that DetermineSafeOldestOffset() and
SetMultiXactIdLimit() were largely ignorant of the possibility that
they might be called at points in time when the cluster was
inconsistent.  SetMultiXactIdLimit() bracketed certain parts of its
logic with if (!InRecovery), but those guards were ineffective because
it gets called before InRecovery is set in the first place.

It seems pretty clear that we can't effectively determine anything
about member wraparound until the cluster is consistent.  Before then,
there might be files missing from the offsets or members SLRUs which
get put back during replay.  There could even be gaps in the sequence
of files, with some things having made it to disk before the crash (or
having made it into the backup) and others not.  So all the work of
determining what the safe stop points and vacuum thresholds for
members are needs to be postponed until TrimMultiXact() time.  And
that's fine, because we don't need this information in recovery anyway
- it only affects behavior in normal running.

So this patch does the following:

1. Moves the call to DetermineSafeOldestOffset() that appears in
StartupMultiXact() to TrimMultiXact(), so that we don't try to do this
until we're consistent.  Also, instead of passing
MultiXactState-oldestMultiXactId, pass the newer of that value and
the earliest offset that exists on disk.  That way, it won't try to
read data that's not there.  Note that the second call to
DetermineSafeOldestOffset() in TruncateMultiXact() doesn't need a
similar guard, because we already bail out of that function early if
the multixacts we're going to truncate away don't exist.

2. Adds a new flag MultiXactState-didTrimMultiXact indicate whether
we've finished TrimMultiXact(), and arranges for SetMultiXactIdLimit()
to use that rather than InRecovery to test whether it's safe to do
complicated things that might require that the cluster is consistent.
This is a slight behavior change, since formerly we would have tried
to do that stuff very early in the startup process, and now it won't
happen until somebody completes a vacuum operation.  If that's a
problem, we could consider doing it in TrimMultiXact(), but I don't
think it's safe the way it was.  The new flag also prevents
oldestOffset from being set while in recovery; I think it would be
safe to do that in recovery once we've reached consistency, but I
don't believe it's necessary.

3. Arranges for TrimMultiXact() to set oldestOffset.  This is
necessary because, previously, we relied on SetMultiXactIdLimit doing
that during early startup or during recovery, and that's no longer
true.  Here too we set oldestOffset keeping in mind that our notion of
the oldest multixact may point to something that doesn't exist; if so,
we use the oldest MXID that does.

4. Modifies TruncateMultiXact() so that it doesn't re-scan the SLRU
directory on every call to find the oldest file that exists.  Instead,
it arranges to remember the value from the first scan and then updates
it thereafter to reflect its own truncation activity.  This isn't
absolutely necessary, but because this oldest-file logic is used in
multiple places (TrimMultiXact, SetMultiXactIdLimit, and
TruncateMultiXact all need it directly or indirectly) caching the
value seems like a better idea than recomputing it frequently.

I have tested that this patch fixes Steve Kehlet's problem, or at
least what I believe to be Steve Kehlet's problem based on the
reproduction scenario I described upthread.  I believe it will also
fix the problems with starting up from a base backup with Alvaro
mentioned upthread.  It won't fix the fact that pg_upgrade is putting
a wrong value into everybody's datminmxid field, which should really
be addressed too, but I've been working on this for about three days
virtually non-stop and I don't have the energy to tackle it right now.
If anyone feels the urge to step into that breech, I think what it
needs to do is: when upgrading from a 9.3-or-later instance, copy over
each database's datminmxid into the corresponding database in the new
cluster.

Aside from that, it's very possible that despite my best efforts this
has serious bugs.  Review and testing would be very much appreciated.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 699497c..8d28a5c 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -197,8 +197,9 @@ typedef struct MultiXactStateData
 	MultiXactOffset nextOffset;
 
 	/*
-	 * Oldest multixact that is still on disk.  Anything older than this
-	 * should not be consulted.  These values are updated

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Josh Berkus

All,

Just saw what looks like a report of this issue on 9.2.

https://github.com/wal-e/wal-e/issues/177

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Steve Kehlet

On Fri, May 29, 2015 at 12:08 PM Robert Haas robertmh...@gmail.com wrote:

 OK, here's a patch.


I grabbed branch REL9_4_STABLE from git, and Robert got me a 9.4-specific
patch. I rebuilt, installed, and postgres started up successfully!  I did a
bunch of checks, had our app run several thousand SQL queries against it,
had a colleague check it out, and it looks good. Looking at top and ps, I
don't see anything funny (e.g. no processes spinning cpu, etc), things look
normal. Let me know if I can provide anything else.

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Thomas Munro

On Fri, May 29, 2015 at 7:56 AM, Robert Haas robertmh...@gmail.com wrote:
 On Thu, May 28, 2015 at 8:51 AM, Robert Haas robertmh...@gmail.com wrote:
 [ speculation ]

 [...]  However, since
 the vacuum did advance relfrozenxid, it will call vac_truncate_clog,
 which will call SetMultiXactIdLimit, which will propagate the bogus
 datminmxid = 1 setting into shared memory.

Ah!

 [...]

 - There's a third possible problem related to boundary cases in
 SlruScanDirCbRemoveMembers, but I don't understand that one well
 enough to explain it.  Maybe Thomas can jump in here and explain the
 concern.

I noticed something in passing which is probably not harmful, and not
relevant to this bug report, it was just a bit confusing while
testing:  SlruScanDirCbRemoveMembers never deletes any files if
rangeStart == rangeEnd.  In practice, if you have an idle cluster with
a lot of multixact data and you VACUUM FREEZE all databases and then
CHECKPOINT, you might be surprised to see no member files going away
quite yet, but they'll eventually be truncated by a future checkpoint,
once rangeEnd has had a chance to advance to the next page due to more
multixacts being created.

If we want to fix this one day, maybe the right thing to do is to
treat the rangeStart == rangeEnd case the same way we treat rangeStart
 rangeEnd, that is, to assume that the range of pages isn't
wrapped/inverted in this case.  Although we don't have the actual
start and end offset values to compare here, we know that for them to
fall on the same page, the start offset index must be = the end
offset index (since we added the new error to prevent member space
wrapping, we never allow the end to get close enough to the start to
fall on the same page).  Like this (not tested):

diff --git a/src/backend/access/transam/multixact.c
b/src/backend/access/transam/multixact.c
index 9568ff1..4d0bcc4 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2755,7 +2755,7 @@ SlruScanDirCbRemoveMembers(SlruCtl ctl, char
*filename, int segpage,
  /* Recheck the deletion condition.  If it still holds, perform deletion */
  if ((range-rangeStart  range-rangeEnd 
  segpage  range-rangeEnd  segpage  range-rangeStart) ||
- (range-rangeStart  range-rangeEnd 
+ (range-rangeStart = range-rangeEnd 
  (segpage  range-rangeStart || segpage  range-rangeEnd)))
  SlruDeleteSegment(ctl, filename);

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Alvaro Herrera

Robert Haas wrote:
 On Thu, May 28, 2015 at 8:51 AM, Robert Haas robertmh...@gmail.com wrote:
  [ speculation ]
 
 OK, I finally managed to reproduce this, after some off-list help from
 Steve Kehlet (the reporter), Alvaro, and Thomas Munro.  Here's how to
 do it:

It's a long list of steps, but if you consider them carefully, it
becomes clear that they are natural steps that a normal production
system would go through -- essentially the only one that's really
time-critical is the decision to pg_upgrade with a version before 9.3.5.

 In the process of investigating this, we found a few other things that
 seem like they may also be bugs:
 
 - As noted upthread, replaying an older checkpoint after a newer
 checkpoint has already happened may lead to similar problems.  This
 may be possible when starting from an online base backup; or when
 restarting a standby that did not perform a restartpoint when
 replaying the last checkpoint before the shutdown.

I'm going through this one now, as it's closely related and caused
issues for us.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Joshua D. Drake



On 05/28/2015 12:56 PM, Robert Haas wrote:




FTR: Robert, you have been a Samurai on this issue. Our many thanks.

Sincerely,

jD


--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing I'm offended is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On Thu, May 28, 2015 at 8:51 AM, Robert Haas robertmh...@gmail.com wrote:
 [ speculation ]

OK, I finally managed to reproduce this, after some off-list help from
Steve Kehlet (the reporter), Alvaro, and Thomas Munro.  Here's how to
do it:

1. Install any pre-9.3 version of the server and generate enough
multixacts to create at least TWO new segments.  When you shut down
the server, all segments except for the most current one will be
removed.  At this point, the only thing in
$PGDATA/pg_multixact/offsets should be a single file, and the name of
that file should not be  or 0001.

2. Use pg_upgrade to upgrade to 9.3.4.  It is possible that versions 
9.3.4 will also work here, but you must not use 9.3.5 or higher,
because 9.3.5 includes Bruce's commit 3d2e18510, which arranged to
preserve relminmxid and datminmxid values.   At this point,
pg_controldata on the new cluster should show an oldestMultiXid value
greater than 1 (copied from the old cluster), but all the datminmxid
values are 1.  Also, initdb will have left behind a bogus  file in
pg_multixact/offsets.

3. Move to 9.3.5 (or 9.3.6), not via pg_upgrade, but just by dropping
in the new binaries.  Follow the instructions in the 9.3.5 release
notes; since you created at least TWO new segments in step one, there
will be no 0001 file, and the query there will say that you should
remove the bogus  file.  So do that, leaving just the good file in
pg_multixact/offsets.  At this point, pg_multixact/offsets is OK, and
pg_controldata still says that oldestMultiXid  1, so that is also OK.
The only problem is that we've got some bogus datminmxid values
floating around.  Our next step will be to convince vacuum to
propagate the bogus datminmxid values back into pg_controldata.

4. Consume at least one transaction ID (e.g. SELECT txid_current())
and then do this:

postgres=# set vacuum_freeze_min_age = 0;
SET
postgres=# set vacuum_freeze_table_age = 0;
SET
postgres=# vacuum;
VACUUM

Setting the GUCs forces full table scans, so that we advance
relfrozenxid.  But notice that we were careful not to just run VACUUM
FREEZE, which would have also advanced relminmxid, which, for purposes
of reproducing this bug, is not what we want to happen.  So relminmxid
is still (incorrectly) set to 1 for every database.  However, since
the vacuum did advance relfrozenxid, it will call vac_truncate_clog,
which will call SetMultiXactIdLimit, which will propagate the bogus
datminmxid = 1 setting into shared memory.

(In my testing, this step doesn't work if performed on 9.3.4; you have
to do it on 9.3.5.  I think that's because of Tom's commit 78db307bb,
but I believe in a more complex test scenario you might be able to get
this to happen on 9.3.4 also.)

I believe it's the case that an autovacuum of even a single table can
substitute for this step if it happens to advance relfrozenxid but not
relminmxid.

5. The next checkpoint, or the shutdown checkpoint in any event, will
propagate the bogus value of 1 from shared memory back into the
control file.

6. Now try to start 9.3.7.  It will see the bogus oldestMultiXid = 1
value in the control file, attempt to read the corresponding offsets
file, and die.

In the process of investigating this, we found a few other things that
seem like they may also be bugs:

- As noted upthread, replaying an older checkpoint after a newer
checkpoint has already happened may lead to similar problems.  This
may be possible when starting from an online base backup; or when
restarting a standby that did not perform a restartpoint when
replaying the last checkpoint before the shutdown.

- pg_upgrade sets datminmxid =
old_cluster.controldata.chkpnt_nxtmulti, which is correct only if
there are ZERO multixacts in use at the time of the upgrade.  It would
be best, I think, to set this to the same value it had in the old
cluster, but if we're going to use a blanket value, I think it needs
to be chkpnt_oldstMulti.

- There's a third possible problem related to boundary cases in
SlruScanDirCbRemoveMembers, but I don't understand that one well
enough to explain it.  Maybe Thomas can jump in here and explain the
concern.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On Thu, May 28, 2015 at 10:41 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid
 values which are equal to the next-mxid counter instead of the correct
 value; in other words, they are too new.

 [ discussion of how the control file's oldestMultiXid gets set ]

I'm talking about the datminmxid in pg_database.  You're talking about
the contents of pg_control.  Those are two different things.  The
relevant code is not what you quote, but rather this:

/* set pg_database.datminmxid */
PQclear(executeQueryOrDie(conn_template1,
  UPDATE
pg_catalog.pg_database 
  SET
datminmxid = '%u',

old_cluster.controldata.chkpnt_nxtmulti));

Tom previously observed this to be wrong, here:

http://www.postgresql.org/message-id/9879.1405877...@sss.pgh.pa.us

Although Tom was correct to note that it's wrong, nothing ever got fixed.  :-(

 A. Most obviously, we should fix pg_upgrade so that it installs
 chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so
 that we stop creating new instances of this problem.  That won't get
 us out of the hole we've dug for ourselves, but we can at least try to
 stop digging.  (This is assuming I'm right that chkpnt_nxtmulti is the
 wrong thing - anyone want to double-check me on that one?)

 I don't think there's anything that we need to fix here.

I see your followup now agreeing this is broken.  Since I wrote the
previous email, I've had two new ideas that I think are both better
than the above.

1. Figure out the oldest multixact offset that actually exists in
pg_multixacts/offsets, and use that value.  If any older MXIDs still
exist, they won't be able to be looked up anyway, so if they wrap
around, it doesn't matter.  The only value that needs to be reliable
in order to do this is pg_controldata's NextMultiXactId, which to the
best of my knowledge is not implicated in any of these bugs.
pg_upgrade can check that the offsets file containing that value
exists, and if not bail out.  Then, start stepping backwards a file at
a time.  When it hits a missing file, the first multixact in the next
file is a safe value of datfrozenxid for every database in the new
cluster.  If older MXIDs exist, they're unreadable anyway, so if they
wrap, nothing lost.  If the value is older than necessary, the first
vacuum in each database will fix it.  We have to be careful: if we
step back too many files, such that our proposed datfrozenxid might
wrap, then we've got a confusing situation and had better bail out -
or at least think really carefully about what to do.

2. When we're upgrading from a version 9.3 or higher, copy the EXACT
datminmxid from each old database to the corresponding new database.
This seems like it ought to be really simple.

 - In DetermineSafeOldestOffset, find_multixact_start() is used to set
 MultiXactState-offsetStopLimit.  If it fails here, we don't know when
 to refuse multixact creation to prevent wraparound.  Again, in
 recovery, that's fine.  If it happens in normal running, it's not
 clear what to do.  Refusing multixact creation is an awfully blunt
 instrument.  Maybe we can scan pg_multixact/offsets to determine a
 workable stop limit: the first file greater than the current file that
 exists, minus two segments, is a good stop point.  Perhaps we ought to
 use this mechanism here categorically, not just when
 find_multixact_start() fails.  It might be more robust than what we
 have now.

 Blunt instruments have the desirable property of being simple.  We don't
 want any more clockwork here, I think --- this stuff is pretty
 complicated already.  As far as I understand, if during normal running
 we see that find_multixact_start has failed, sufficient vacuuming should
 get it straight eventually with no loss of data.

Unfortunately, I don't believe that to be true.  If
find_multixact_start() fails, we have no idea how close we are to the
member wraparound point.  Sure, we can start vacuuming, but the user
can be creating new, large multixacts at top speed while we're doing
that, which could cause us to wrap around before we can finish
vacuuming.

Furthermore, if we adopted the blunt instrument, users who are in this
situation would update to 9.4.3 (or whenever these fixes get released)
and find that they can't create new MXIDs for a possibly very
protracted period of time.  That amounts to an outage for which users
won't thank us.

Looking at the files in the directory seems pretty simple in this
case, and quite a bit more fail-safe than what we're doing right now.
The current logic purports to leave a one-file gap in the member
space, but there's no guarantee that the gap really exists on disk the
way we think it does.  With this approach, we can be certain that
there is a gap.  And that is a darned good thing to be certain about.

 C. I think we should also change

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

On Thu, May 28, 2015 at 4:06 PM, Joshua D. Drake j...@commandprompt.com wrote:
 FTR: Robert, you have been a Samurai on this issue. Our many thanks.

Thanks!  I really appreciate the kind words.

So, in thinking through this situation further, it seems to me that
the situation is pretty dire:

1. If you pg_upgrade to 9.3 before 9.3.5, then you may have relminmxid
or datminmxid values which are 1 instead of the correct value.
Setting the value to 1 was too far in the past if your MXID counter is
 2B, and too far in the future if your MXID counter is  2B.

2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid
values which are equal to the next-mxid counter instead of the correct
value; in other words, they are two new.

3. If you pg_upgrade to 9.3.5, 9.3.6, 9.4.0, or 9.4.1, then you will
have the first problem for tables in template databases, and the
second one for the rest. (See 866f3017a.)

4. Wrong relminmxid or datminmxid values can eventually propagate into
the control file, as demonstrated in my previous post.  Therefore, we
can't count on relminmxid to be correct, we can't count on datminmxid
to be correct, and we can't count on the control file to be correct.
That's a sack of sad.

5. If the values are too far in the past, then nothing really terrible
will happen unless you upgrade to 9.3.7 or 9.4.2, at which point the
system will refuse to start.  Forcing a VACUUM FREEZE on every
database, including the unconnectable ones, should fix this and allow
you to upgrade safely - which you want to do, because 9.3.7 and 9.4.2
fix a different set of multixact data loss bugs.

6. If the values are too far in the future, the system may fail to
prevent wraparound, leading to data loss.  I am not totally clear on
whether a VACUUM FREEZE will fix this problem.  It seems like the
chances are better if you are running at least 9.3.5+ or 9.4.X,
because of 78db307bb.  But I'm not sure how complete a fix that is.

So what do we do about this?  I have a few ideas:

A. Most obviously, we should fix pg_upgrade so that it installs
chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so
that we stop creating new instances of this problem.  That won't get
us out of the hole we've dug for ourselves, but we can at least try to
stop digging.  (This is assuming I'm right that chkpnt_nxtmulti is the
wrong thing - anyone want to double-check me on that one?)

B. We need to change find_multixact_start() to fail softly.  This is
important because it's legitimate for it to fail in recovery, as
discussed upthread, and also because we probably want to eliminate the
fail-to-start hazard introduced in 9.4.2 and 9.3.7.
find_multixact_start() is used in three places, and they each require
separate handling:

- In SetMultiXactIdLimit, find_multixact_start() is used to set
MultiXactState-oldestOffset, which is used to determine how
aggressively to vacuum.  If find_multixact_start() fails, we don't
know how aggressively we need to vacuum to prevent members wraparound;
it's probably best to decide to vacuum as aggressively as possible.
Of course, if we're in recovery, we won't vacuum either way; the fact
that it fails softly is good enough.

- In DetermineSafeOldestOffset, find_multixact_start() is used to set
MultiXactState-offsetStopLimit.  If it fails here, we don't know when
to refuse multixact creation to prevent wraparound.  Again, in
recovery, that's fine.  If it happens in normal running, it's not
clear what to do.  Refusing multixact creation is an awfully blunt
instrument.  Maybe we can scan pg_multixact/offsets to determine a
workable stop limit: the first file greater than the current file that
exists, minus two segments, is a good stop point.  Perhaps we ought to
use this mechanism here categorically, not just when
find_multixact_start() fails.  It might be more robust than what we
have now.

- In TruncateMultiXact, find_multixact_start() is used to set the
truncation point for the members SLRU.  If it fails here, I'm guessing
the right solution is not to truncate anything - instead, rely on
intense vacuuming to eventually advance oldestMXact to a value whose
member data still exists; truncate then.

C. I think we should also change TruncateMultiXact() to truncate
offsets first, and then members.  As things stand, if we truncate
members first, we increase the risk of seeing an offset that will fail
when passed to find_multixact_start(), because TruncateMultiXact()
might get interrupted before it finishes.  That seem like an
unnecessary risk.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Alvaro Herrera

Robert Haas wrote:

 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid
 values which are equal to the next-mxid counter instead of the correct
 value; in other words, they are too new.

What you describe is what happens if you upgrade from 9.2 or earlier.
For this case we use this call:

exec_prog(UTILITY_LOG_FILE, NULL, true,
  \%s/pg_resetxlog\ -m %u,%u \%s\,
  new_cluster.bindir,
  old_cluster.controldata.chkpnt_nxtmulti + 1,
  old_cluster.controldata.chkpnt_nxtmulti,
  new_cluster.pgdata);

This uses the old cluster's nextMulti value as oldestMulti in the new
cluster, and that value+1 is used as nextMulti.  This is correct: we
don't want to preserve any of the multixact state from the previous
cluster; anything before that value can be truncated with no loss of
critical data.  In fact, there is no critical data before that value at
all.

If you upgrade from 9.3, this other call is used instead:

/*
 * we preserve all files and contents, so we must preserve both next
 * counters here and the oldest multi present on system.
 */
exec_prog(UTILITY_LOG_FILE, NULL, true,
  \%s/pg_resetxlog\ -O %u -m %u,%u \%s\,
  new_cluster.bindir,
  old_cluster.controldata.chkpnt_nxtmxoff,
  old_cluster.controldata.chkpnt_nxtmulti,
  old_cluster.controldata.chkpnt_oldstMulti,
  new_cluster.pgdata);

In this case we use the oldestMulti from the old cluster as oldestMulti
in the new cluster, which is also correct.


 A. Most obviously, we should fix pg_upgrade so that it installs
 chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so
 that we stop creating new instances of this problem.  That won't get
 us out of the hole we've dug for ourselves, but we can at least try to
 stop digging.  (This is assuming I'm right that chkpnt_nxtmulti is the
 wrong thing - anyone want to double-check me on that one?)

I don't think there's anything that we need to fix here.


 B. We need to change find_multixact_start() to fail softly.  This is
 important because it's legitimate for it to fail in recovery, as
 discussed upthread, and also because we probably want to eliminate the
 fail-to-start hazard introduced in 9.4.2 and 9.3.7.
 find_multixact_start() is used in three places, and they each require
 separate handling:
 
 - In SetMultiXactIdLimit, find_multixact_start() is used to set
 MultiXactState-oldestOffset, which is used to determine how
 aggressively to vacuum.  If find_multixact_start() fails, we don't
 know how aggressively we need to vacuum to prevent members wraparound;
 it's probably best to decide to vacuum as aggressively as possible.
 Of course, if we're in recovery, we won't vacuum either way; the fact
 that it fails softly is good enough.

Sounds good.

 - In DetermineSafeOldestOffset, find_multixact_start() is used to set
 MultiXactState-offsetStopLimit.  If it fails here, we don't know when
 to refuse multixact creation to prevent wraparound.  Again, in
 recovery, that's fine.  If it happens in normal running, it's not
 clear what to do.  Refusing multixact creation is an awfully blunt
 instrument.  Maybe we can scan pg_multixact/offsets to determine a
 workable stop limit: the first file greater than the current file that
 exists, minus two segments, is a good stop point.  Perhaps we ought to
 use this mechanism here categorically, not just when
 find_multixact_start() fails.  It might be more robust than what we
 have now.

Blunt instruments have the desirable property of being simple.  We don't
want any more clockwork here, I think --- this stuff is pretty
complicated already.  As far as I understand, if during normal running
we see that find_multixact_start has failed, sufficient vacuuming should
get it straight eventually with no loss of data.

 - In TruncateMultiXact, find_multixact_start() is used to set the
 truncation point for the members SLRU.  If it fails here, I'm guessing
 the right solution is not to truncate anything - instead, rely on
 intense vacuuming to eventually advance oldestMXact to a value whose
 member data still exists; truncate then.

Fine.

 C. I think we should also change TruncateMultiXact() to truncate
 offsets first, and then members.  As things stand, if we truncate
 members first, we increase the risk of seeing an offset that will fail
 when passed to find_multixact_start(), because TruncateMultiXact()
 might get interrupted before it finishes.  That seem like an
 unnecessary risk.

Not sure about this point.  We did it the way you propose previously,
and found it to be a problem because sometimes we tried to read an
offset file that was no longer there.  Do we really read member files
anywhere?  I thought we only tried to read offset files.  If we remove
member files, what is it that we try to read and find not to be present?

--

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Alvaro Herrera

Alvaro Herrera wrote:
 Robert Haas wrote:
 
  2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid
  values which are equal to the next-mxid counter instead of the correct
  value; in other words, they are too new.
 
 What you describe is what happens if you upgrade from 9.2 or earlier.

Oh, you're referring to pg_database values, not the ones in pg_control.
Ugh :-(  This invalidates my argument that there's nothing to fix,
obviously ... it's clearly broken as is.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [GENERAL] 9.4.1 - 9.4.2 problem: could not access status of transaction 1