Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Alvaro Herrera
Alvaro Herrera wrote:
> Robert Haas wrote:
> 
> > So here's a patch taking a different approach.
> 
> I tried to apply this to 9.3 but it's messy because of pgindent.  Anyone
> would have a problem with me backpatching a pgindent run of multixact.c?

Done.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Robert Haas
On Thu, Jun 4, 2015 at 1:27 PM, Andres Freund  wrote:
> On 2015-06-04 12:57:42 -0400, Robert Haas wrote:
>> + /*
>> +  * Do we need an emergency autovacuum?  If we're not sure, assume yes.
>> +  */
>> + return !oldestOffsetKnown ||
>> + (nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);
>
> I think without teaching autovac about those rules, this might just lead
> to lots of autovac processes starting without knowing they should do
> something? They know about autovacuum_multixact_freeze_age, but they
> know neither about !oldestOffsetKnown nor, afaics, about pending offset
> wraparounds?

You're right, but that's why the latest patch has changes in
MultiXactMemberFreezeThreshold.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Andres Freund
Hi,

On 2015-06-04 12:57:42 -0400, Robert Haas wrote:
> + /*
> +  * Do we need an emergency autovacuum?  If we're not sure, assume yes.
> +  */
> + return !oldestOffsetKnown ||
> + (nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD);

I think without teaching autovac about those rules, this might just lead
to lots of autovac processes starting without knowing they should do
something? They know about autovacuum_multixact_freeze_age, but they
know neither about !oldestOffsetKnown nor, afaics, about pending offset
wraparounds?

> -static MultiXactOffset
> -find_multixact_start(MultiXactId multi)
> +static bool
> +find_multixact_start(MultiXactId multi, MultiXactOffset *result)
>  {
>   MultiXactOffset offset;
>   int pageno;
> @@ -2630,6 +2741,9 @@ find_multixact_start(MultiXactId multi)
>   pageno = MultiXactIdToOffsetPage(multi);
>   entryno = MultiXactIdToOffsetEntry(multi);
>  
> + if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
> + return false;
> +
>   /* lock is acquired by SimpleLruReadPage_ReadOnly */
>   slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, multi);
>   offptr = (MultiXactOffset *) 
> MultiXactOffsetCtl->shared->page_buffer[slotno];
> @@ -2642,25 +2756,31 @@ find_multixact_start(MultiXactId multi)
>

I think it'd be a good idea to also return false in case of a
InvalidMultiXactId - that'll be returned if the page has been zeroed.


Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Alvaro Herrera
Robert Haas wrote:

> So here's a patch taking a different approach.

I tried to apply this to 9.3 but it's messy because of pgindent.  Anyone
would have a problem with me backpatching a pgindent run of multixact.c?

Also, you have a new function SlruPageExists, but we already have
SimpleLruDoesPhysicalPageExist.  No need for two copies of pretty much
the same code, I think.  Your code uses fstat() instead of
lseek(.., SEEK_END) but the exact technique used is probably not
relevant.

I think I like this approach better than your other patch, FWIW --
mainly because it seems simpler.

Will review.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Robert Haas
On Wed, Jun 3, 2015 at 8:24 AM, Robert Haas  wrote:
> On Tue, Jun 2, 2015 at 5:22 PM, Andres Freund  wrote:
>>> > Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning
>>> > the disk it'll always get one at a segment boundary, right? I'm not sure
>>> > that's actually ok; because the value at the beginning of the segment
>>> > can very well end up being a 0, as MaybeExtendOffsetSlru() will have
>>> > filled the page with zeros.
>>> >
>>> > I think that should be harmless, the worst that can happen is that
>>> > oldestOffset errorneously is 0, which should be correct, even though
>>> > possibly overly conservative, in these cases.
>>>
>>> Uh oh.  That seems like a real bad problem for this approach.  What
>>> keeps that from being the opposite of too conservative?  There's no
>>> "safe" value in a circular numbering space.
>>
>> I think it *might* (I'm really jetlagged) be fine because that'll only
>> happen after a upgrade from < 9.3. And in that case we initialize
>> nextOffset to 0. That ought to safe us?
>
> That's pretty much betting the farm on the bugs we know about today
> being the only ones there are.  That seems imprudent.

So here's a patch taking a different approach.  In this approach, if
the multixact whose members we want to look up doesn't exist, we don't
use a later one (that might or might not be valid).  Instead, we
attempt to cope with the unknown.  That means:

1. In TruncateMultiXact(), we don't truncate.

2. If setting the offset stop limit (the point where we refuse to
create new multixact space), we don't arm the stop point.  This means
that if you're in this situation, you run without member wraparound
protection until it's corrected.  A message gets logged once per
checkpoint telling you that you have this problem, and another message
gets logged when things get straightened out and the guards are
enabled.

3. If setting the vacuum force point, we assume that it's appropriate
to immediately force vacuum.

I've only tested this very lightly - this is just to see what you and
Noah and others think of the approach.  As compared with the previous
approach, it has the advantage of making minimal assumptions about the
sanity of what's on disk.  It has the disadvantage that, for some
people, the member-wraparound guard won't be enabled at startup -- but
note that those people can't start 9.3.7/9.4.2 *at all*, so currently
they are either running without member wraparound protection anyway
(if they haven't upgraded to those releases) or they're down entirely.
Another disadvantage is that we'll be triggering what may be quite a
bit of autovacuum activity for some people, which could be painful.
On the plus side, they'll hopefully end up with sane relminmxid and
datminmxid guards afterwards.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9568ff1..4400fc5 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -198,13 +198,24 @@ typedef struct MultiXactStateData
 	/* next-to-be-assigned offset */
 	MultiXactOffset nextOffset;
 
+	/* Have we completed multixact startup? */
+	bool		finishedStartup;
+
 	/*
-	 * Oldest multixact that is still on disk.  Anything older than this
-	 * should not be consulted.  These values are updated by vacuum.
+	 * Oldest multixact that is still potentially referenced by a relation.
+	 * Anything older than this should not be consulted.  These values are
+	 * updated by vacuum.
 	 */
 	MultiXactId oldestMultiXactId;
 	Oid			oldestMultiXactDB;
+
+	/*
+	 * Oldest multixact offset that is potentially referenced by a
+	 * multixact referenced by a relation.  We don't always know this value,
+	 * so there's a flag here to indicate whether or not we currently do.
+	 */
 	MultiXactOffset oldestOffset;
+	bool		oldestOffsetKnown;
 
 	/*
 	 * This is what the previous checkpoint stored as the truncate position.
@@ -221,6 +232,7 @@ typedef struct MultiXactStateData
 
 	/* support for members anti-wraparound measures */
 	MultiXactOffset offsetStopLimit;
+	bool offsetStopLimitKnown;
 
 	/*
 	 * Per-backend data starts here.  We have two arrays stored in the area
@@ -350,10 +362,11 @@ static bool MultiXactOffsetPrecedes(MultiXactOffset offset1,
 		MultiXactOffset offset2);
 static void ExtendMultiXactOffset(MultiXactId multi);
 static void ExtendMultiXactMember(MultiXactOffset offset, int nmembers);
-static void DetermineSafeOldestOffset(MultiXactId oldestMXact);
+static void DetermineSafeOldestOffset(MultiXactOffset oldestMXact);
 static bool MultiXactOffsetWouldWrap(MultiXactOffset boundary,
 		 MultiXactOffset start, uint32 distance);
-static MultiXactOffset find_multixact_start(MultiXactId multi);
+static bool SetOffsetVacuumLimit(bool finish_setup);
+static bool find_multixact_start(MultiXactId multi, MultiXactOffset *result);
 static void WriteMZero

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Alvaro Herrera
Andres Freund wrote:
> On 2015-06-03 15:01:46 -0300, Alvaro Herrera wrote:

> > One idea I had was: what if the oldestMulti pointed to another multi
> > earlier in the same 0046 file, so that it is read-as-zeroes (and the
> > file is created), and then a subsequent multixact truncate tries to read
> > a later page in the file.  In SlruPhysicalReadPage() this would give a
> > change for open() to not fail, and then read() can fail as above.
> > However, we would have an earlier LOG message about "reading as zeroes".
> > 
> > Really, the whole question of how this code goes past the open() failure
> > in SlruPhysicalReadPage baffles me.  I don't see any possible way for
> > the file to be created ...
> 
> Wouldn't a previous WAL record zeroing another part of that segment
> explain this? A zero sized segment pretty much would lead to this error,
> right? Or were you able to check how things look after the failure?

But why would there be a previous WAL record zeroing another part of
that segment?  Note that this segment is very old -- hasn't been written
in quite a while, it's certainly not in slru buffers anymore.

> > 2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode
> > 2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: restored log file 
> > "000173DD00AD" from archive
> > 2015-05-27 16:15:18 UTC [4782]: [5-1] user=,db= FATAL: could not access 
> > status of transaction 4624559
> > 2015-05-27 16:15:18 UTC [4782]: [6-1] user=,db= DETAIL: Could not read from 
> > file "pg_multixact/offsets/0046" at offset 147456: Success.
> > 2015-05-27 16:15:18 UTC [4778]: [4-1] user=,db= LOG: startup process (PID 
> > 4782) exited with exit code 1
> > 2015-05-27 16:15:18 UTC [4778]: [5-1] user=,db= LOG: aborting startup due 
> > to startup process failure
> 
> From this isn't not entirely clear where this error was triggered from.

Well, reading code, it seems reasonable that to assume that replay of
the checkpoint record I mentioned leads to that error message when the
file exists but is not long enough to contain the given offset.  There
are not MultiXact wal records in the segment.  Also note that there's no
other "restored log file" message after the "entering standby mode"
message.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Andres Freund
On 2015-06-03 15:01:46 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> > That's not necessarily the case though, given how the code currently
> > works. In a bunch of places the SLRUs are accessed *before* having been
> > made consistent by WAL replay. Especially if several checkpoints/vacuums
> > happened during the base backup the assumed state (i.e. the mxacts
> > checkpoints refer to) of the data directory soon after the initial
> > start, and the state of pg_multixact/ won't necessarily match at all.
> 
> Well, the log extract is quite simple; it says that as soon as the
> standby starts replay WAL, it fetches exactly one WAL segment, which
> contains exactly one online checkpoint record; this record contains
> oldestMulti=4624559, which belongs in offset file 0046.  But the
> basebackup only contains files starting from 004B onwards.  The base
> backup takes a long time, so my guess of what happened is that the
> backup label points to the start of the WAL stream (obviously), then the
> data files are copied to the standby; during this long process, a
> multixact truncation happens.  So by the time the base backup reaches
> the pg_multixact directory, the 0046 file has been removed.

Yes. That's precisely what I was concerned about above and elsewhere in
the thread. It's simply not ok to access a SLRU while not yet
consistent...

> One idea I had was: what if the oldestMulti pointed to another multi
> earlier in the same 0046 file, so that it is read-as-zeroes (and the
> file is created), and then a subsequent multixact truncate tries to read
> a later page in the file.  In SlruPhysicalReadPage() this would give a
> change for open() to not fail, and then read() can fail as above.
> However, we would have an earlier LOG message about "reading as zeroes".
> 
> Really, the whole question of how this code goes past the open() failure
> in SlruPhysicalReadPage baffles me.  I don't see any possible way for
> the file to be created ...

Wouldn't a previous WAL record zeroing another part of that segment
explain this? A zero sized segment pretty much would lead to this error,
right? Or were you able to check how things look after the failure?

> 2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode
> 2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: restored log file 
> "000173DD00AD" from archive
> 2015-05-27 16:15:18 UTC [4782]: [5-1] user=,db= FATAL: could not access 
> status of transaction 4624559
> 2015-05-27 16:15:18 UTC [4782]: [6-1] user=,db= DETAIL: Could not read from 
> file "pg_multixact/offsets/0046" at offset 147456: Success.
> 2015-05-27 16:15:18 UTC [4778]: [4-1] user=,db= LOG: startup process (PID 
> 4782) exited with exit code 1
> 2015-05-27 16:15:18 UTC [4778]: [5-1] user=,db= LOG: aborting startup due to 
> startup process failure

>From this isn't not entirely clear where this error was triggered from.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Alvaro Herrera
Alvaro Herrera wrote:

> Really, the whole question of how this code goes past the open() failure
> in SlruPhysicalReadPage baffles me.  I don't see any possible way for
> the file to be created ...

Hmm, the checkpointer can call TruncateMultiXact when in recovery, on
restartpoints. I wonder if in recovery this could trigger file creation.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Alvaro Herrera
Andres Freund wrote:
> On 2015-06-03 00:42:55 -0300, Alvaro Herrera wrote:
> > Thomas Munro wrote:
> > > On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera  
> > > wrote:
> > > > My guess is that the file existed, and perhaps had one or more pages,
> > > > but the wanted page doesn't exist, so we tried to read but got 0 bytes
> > > > back.  read() returns 0 in this case but doesn't set errno.
> > > >
> > > > I didn't find a way to set things so that the file exists but is of
> > > > shorter contents than oldestMulti by the time the checkpoint record is
> > > > replayed.
> > > 
> > > I'm just starting to learn about the recovery machinery, so forgive me
> > > if I'm missing something basic here, but I just don't get this.  As I
> > > understand it, offsets/0046 should either have been copied with that
> > > page present in it if it existed before the backup started (apparently
> > > not in this case), or extended to contain it by WAL records that come
> > > after the backup label but before the checkpoint record that
> > > references it (also apparently not in this case).
> 
> That's not necessarily the case though, given how the code currently
> works. In a bunch of places the SLRUs are accessed *before* having been
> made consistent by WAL replay. Especially if several checkpoints/vacuums
> happened during the base backup the assumed state (i.e. the mxacts
> checkpoints refer to) of the data directory soon after the initial
> start, and the state of pg_multixact/ won't necessarily match at all.

Well, the log extract is quite simple; it says that as soon as the
standby starts replay WAL, it fetches exactly one WAL segment, which
contains exactly one online checkpoint record; this record contains
oldestMulti=4624559, which belongs in offset file 0046.  But the
basebackup only contains files starting from 004B onwards.  The base
backup takes a long time, so my guess of what happened is that the
backup label points to the start of the WAL stream (obviously), then the
data files are copied to the standby; during this long process, a
multixact truncation happens.  So by the time the base backup reaches
the pg_multixact directory, the 0046 file has been removed.

2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode
2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: restored log file 
"000173DD00AD" from archive
2015-05-27 16:15:18 UTC [4782]: [5-1] user=,db= FATAL: could not access status 
of transaction 4624559
2015-05-27 16:15:18 UTC [4782]: [6-1] user=,db= DETAIL: Could not read from 
file "pg_multixact/offsets/0046" at offset 147456: Success.
2015-05-27 16:15:18 UTC [4778]: [4-1] user=,db= LOG: startup process (PID 4782) 
exited with exit code 1
2015-05-27 16:15:18 UTC [4778]: [5-1] user=,db= LOG: aborting startup due to 
startup process failure

One idea I had was: what if the oldestMulti pointed to another multi
earlier in the same 0046 file, so that it is read-as-zeroes (and the
file is created), and then a subsequent multixact truncate tries to read
a later page in the file.  In SlruPhysicalReadPage() this would give a
change for open() to not fail, and then read() can fail as above.
However, we would have an earlier LOG message about "reading as zeroes".

Really, the whole question of how this code goes past the open() failure
in SlruPhysicalReadPage baffles me.  I don't see any possible way for
the file to be created ...

> > Exactly --- that's the spot at which I am, also.  I have had this
> > spinning in my head for three days now, and tried every single variation
> > that I could think of, but like you I was unable to reproduce the issue.
> > However, our customer took a second base backup and it failed in exactly
> > the same way, module some changes to the counters (the file that
> > didn't exist was 004B rather than 0046).  I'm still at a loss at what
> > the failure mode is.  We must be missing some crucial detail ...
> 
> I might have missed it in this already long thread. Could you share a
> bunch of details about hte case? It'd be very interesting to see the
> contents of the backup label (to see where start/end are), the contents
> of the initial checkpoint (to see which mxacts we assume to exist at
> start) and what the initial contents of pg_multixact are (to match up).

pg_xlogdump says about the checkpoint record:

rmgr: XLOGlen (rec/tot): 72/   104, tx:  0, lsn: 
73DD/AD834470, prev 73DD/AD8343B8, bkp: , desc: checkpoint: redo 
73DD/4003F648; tli 1; prev tli 1; fpw true; xid 1/3530575042; oid 81303440; 
multi 14003656; offset 66311341; oldest xid 2530472730 in DB 17081243; oldest 
multi 4624559 in DB 17081243; oldest running xid 3530551992; online

I don't have the backup label.  But a listing of the files in
pg_multixact/offsets in the master has this:

./offsets:
total 35816
drwx-- 2 postgres postgres 4096 May 27 16:25 .
drwx-- 4 postgres postgres 4096 Jan 3 03:32 ..
-rw--- 1 postgres postgres 262144 Apr 20 13:07 004

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Andres Freund
On 2015-06-03 00:42:55 -0300, Alvaro Herrera wrote:
> Thomas Munro wrote:
> > On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera  
> > wrote:
> > > My guess is that the file existed, and perhaps had one or more pages,
> > > but the wanted page doesn't exist, so we tried to read but got 0 bytes
> > > back.  read() returns 0 in this case but doesn't set errno.
> > >
> > > I didn't find a way to set things so that the file exists but is of
> > > shorter contents than oldestMulti by the time the checkpoint record is
> > > replayed.
> > 
> > I'm just starting to learn about the recovery machinery, so forgive me
> > if I'm missing something basic here, but I just don't get this.  As I
> > understand it, offsets/0046 should either have been copied with that
> > page present in it if it existed before the backup started (apparently
> > not in this case), or extended to contain it by WAL records that come
> > after the backup label but before the checkpoint record that
> > references it (also apparently not in this case).

That's not necessarily the case though, given how the code currently
works. In a bunch of places the SLRUs are accessed *before* having been
made consistent by WAL replay. Especially if several checkpoints/vacuums
happened during the base backup the assumed state (i.e. the mxacts
checkpoints refer to) of the data directory soon after the initial
start, and the state of pg_multixact/ won't necessarily match at all.

> Exactly --- that's the spot at which I am, also.  I have had this
> spinning in my head for three days now, and tried every single variation
> that I could think of, but like you I was unable to reproduce the issue.
> However, our customer took a second base backup and it failed in exactly
> the same way, module some changes to the counters (the file that
> didn't exist was 004B rather than 0046).  I'm still at a loss at what
> the failure mode is.  We must be missing some crucial detail ...

I might have missed it in this already long thread. Could you share a
bunch of details about hte case? It'd be very interesting to see the
contents of the backup label (to see where start/end are), the contents
of the initial checkpoint (to see which mxacts we assume to exist at
start) and what the initial contents of pg_multixact are (to match up).

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Alvaro Herrera
Thomas Munro wrote:

> I have finally reproduced that error!  See attached repro shell script.
> 
> The conditions are:
> 
> 1.  next multixact == oldest multixact (no active multixacts, pointing
> past the end)
> 2.  next multixact would be the first item on a new page (multixact % 2048 == 
> 0)
> 3.  the page must not be the first in a segment (or we'd get the
> read-zeroes case)
> 
> That gives you odds of 1/2048 * 31/32 * (probability of a wraparound
> vacuum followed by no multixact creations right before your backup
> checkpoint).  That seems like reasonably low odds... if it happened
> twice in a row, maybe I'm missing something here and there is some
> other way to get this...

Thanks, this is pretty cool (as far as these things go), but it's not
the scenario I see, in which the complained-about segment is very old by
the time it's truncated away by a checkpoint after freeze.  Segment
requested by checkpoint.oldestMulti is 0046 (offset 140k something --
just to clarify it's not at segment boundary), and the base backup
contains segments from 004B to 00D5.  My problem scenario has
oldestMulti close to 5 million and nextMulti close to 14 million.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Robert Haas
On Tue, Jun 2, 2015 at 5:22 PM, Andres Freund  wrote:
>> > Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning
>> > the disk it'll always get one at a segment boundary, right? I'm not sure
>> > that's actually ok; because the value at the beginning of the segment
>> > can very well end up being a 0, as MaybeExtendOffsetSlru() will have
>> > filled the page with zeros.
>> >
>> > I think that should be harmless, the worst that can happen is that
>> > oldestOffset errorneously is 0, which should be correct, even though
>> > possibly overly conservative, in these cases.
>>
>> Uh oh.  That seems like a real bad problem for this approach.  What
>> keeps that from being the opposite of too conservative?  There's no
>> "safe" value in a circular numbering space.
>
> I think it *might* (I'm really jetlagged) be fine because that'll only
> happen after a upgrade from < 9.3. And in that case we initialize
> nextOffset to 0. That ought to safe us?

That's pretty much betting the farm on the bugs we know about today
being the only ones there are.  That seems imprudent.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Robert Haas
On Wed, Jun 3, 2015 at 4:48 AM, Thomas Munro
 wrote:
> On Wed, Jun 3, 2015 at 3:42 PM, Alvaro Herrera  
> wrote:
>> Thomas Munro wrote:
>>> On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera  
>>> wrote:
>>> > My guess is that the file existed, and perhaps had one or more pages,
>>> > but the wanted page doesn't exist, so we tried to read but got 0 bytes
>>> > back.  read() returns 0 in this case but doesn't set errno.
>>> >
>>> > I didn't find a way to set things so that the file exists but is of
>>> > shorter contents than oldestMulti by the time the checkpoint record is
>>> > replayed.
>>>
>>> I'm just starting to learn about the recovery machinery, so forgive me
>>> if I'm missing something basic here, but I just don't get this.  As I
>>> understand it, offsets/0046 should either have been copied with that
>>> page present in it if it existed before the backup started (apparently
>>> not in this case), or extended to contain it by WAL records that come
>>> after the backup label but before the checkpoint record that
>>> references it (also apparently not in this case).
>>
>> Exactly --- that's the spot at which I am, also.  I have had this
>> spinning in my head for three days now, and tried every single variation
>> that I could think of, but like you I was unable to reproduce the issue.
>> However, our customer took a second base backup and it failed in exactly
>> the same way, module some changes to the counters (the file that
>> didn't exist was 004B rather than 0046).  I'm still at a loss at what
>> the failure mode is.  We must be missing some crucial detail ...
>
> I have finally reproduced that error!  See attached repro shell script.
>
> The conditions are:
>
> 1.  next multixact == oldest multixact (no active multixacts, pointing
> past the end)
> 2.  next multixact would be the first item on a new page (multixact % 2048 == 
> 0)
> 3.  the page must not be the first in a segment (or we'd get the
> read-zeroes case)
>
> That gives you odds of 1/2048 * 31/32 * (probability of a wraparound
> vacuum followed by no multixact creations right before your backup
> checkpoint).  That seems like reasonably low odds... if it happened
> twice in a row, maybe I'm missing something here and there is some
> other way to get this...
>
> I realise now that this is actually a symptom of a problem spotted by
> Noah recently:
>
> http://www.postgresql.org/message-id/20150601045534.gb23...@tornado.leadboat.com
>
> He noticed the problem for segment boundaries, when not in recovery.
> In recovery, segment boundaries don't raise an error (the read-zeroes
> case applies), but page boundaries do.  The fix is probably to do
> nothing if they are the same, as we do elsewhere, like in the attached
> patch.

Actually, we still need to call DetermineSafeOldestOffset in that
case.  Otherwise, if someone goes from lots of multixacts to none, the
stop point won't advance.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Thomas Munro
On Wed, Jun 3, 2015 at 3:42 PM, Alvaro Herrera  wrote:
> Thomas Munro wrote:
>> On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera  
>> wrote:
>> > My guess is that the file existed, and perhaps had one or more pages,
>> > but the wanted page doesn't exist, so we tried to read but got 0 bytes
>> > back.  read() returns 0 in this case but doesn't set errno.
>> >
>> > I didn't find a way to set things so that the file exists but is of
>> > shorter contents than oldestMulti by the time the checkpoint record is
>> > replayed.
>>
>> I'm just starting to learn about the recovery machinery, so forgive me
>> if I'm missing something basic here, but I just don't get this.  As I
>> understand it, offsets/0046 should either have been copied with that
>> page present in it if it existed before the backup started (apparently
>> not in this case), or extended to contain it by WAL records that come
>> after the backup label but before the checkpoint record that
>> references it (also apparently not in this case).
>
> Exactly --- that's the spot at which I am, also.  I have had this
> spinning in my head for three days now, and tried every single variation
> that I could think of, but like you I was unable to reproduce the issue.
> However, our customer took a second base backup and it failed in exactly
> the same way, module some changes to the counters (the file that
> didn't exist was 004B rather than 0046).  I'm still at a loss at what
> the failure mode is.  We must be missing some crucial detail ...

I have finally reproduced that error!  See attached repro shell script.

The conditions are:

1.  next multixact == oldest multixact (no active multixacts, pointing
past the end)
2.  next multixact would be the first item on a new page (multixact % 2048 == 0)
3.  the page must not be the first in a segment (or we'd get the
read-zeroes case)

That gives you odds of 1/2048 * 31/32 * (probability of a wraparound
vacuum followed by no multixact creations right before your backup
checkpoint).  That seems like reasonably low odds... if it happened
twice in a row, maybe I'm missing something here and there is some
other way to get this...

I realise now that this is actually a symptom of a problem spotted by
Noah recently:

http://www.postgresql.org/message-id/20150601045534.gb23...@tornado.leadboat.com

He noticed the problem for segment boundaries, when not in recovery.
In recovery, segment boundaries don't raise an error (the read-zeroes
case applies), but page boundaries do.  The fix is probably to do
nothing if they are the same, as we do elsewhere, like in the attached
patch.

-- 
Thomas Munro
http://www.enterprisedb.com


fix-truncate-none.patch
Description: Binary data


copy-page-boundary.sh
Description: Bourne shell script

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Alvaro Herrera
Thomas Munro wrote:
> On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera  
> wrote:
> > My guess is that the file existed, and perhaps had one or more pages,
> > but the wanted page doesn't exist, so we tried to read but got 0 bytes
> > back.  read() returns 0 in this case but doesn't set errno.
> >
> > I didn't find a way to set things so that the file exists but is of
> > shorter contents than oldestMulti by the time the checkpoint record is
> > replayed.
> 
> I'm just starting to learn about the recovery machinery, so forgive me
> if I'm missing something basic here, but I just don't get this.  As I
> understand it, offsets/0046 should either have been copied with that
> page present in it if it existed before the backup started (apparently
> not in this case), or extended to contain it by WAL records that come
> after the backup label but before the checkpoint record that
> references it (also apparently not in this case).

Exactly --- that's the spot at which I am, also.  I have had this
spinning in my head for three days now, and tried every single variation
that I could think of, but like you I was unable to reproduce the issue.
However, our customer took a second base backup and it failed in exactly
the same way, module some changes to the counters (the file that
didn't exist was 004B rather than 0046).  I'm still at a loss at what
the failure mode is.  We must be missing some crucial detail ...


-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Thomas Munro
On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera  wrote:
> My guess is that the file existed, and perhaps had one or more pages,
> but the wanted page doesn't exist, so we tried to read but got 0 bytes
> back.  read() returns 0 in this case but doesn't set errno.
>
> I didn't find a way to set things so that the file exists but is of
> shorter contents than oldestMulti by the time the checkpoint record is
> replayed.

I'm just starting to learn about the recovery machinery, so forgive me
if I'm missing something basic here, but I just don't get this.  As I
understand it, offsets/0046 should either have been copied with that
page present in it if it existed before the backup started (apparently
not in this case), or extended to contain it by WAL records that come
after the backup label but before the checkpoint record that
references it (also apparently not in this case).  If neither of these
things happened then that is completely different from the
segment-does-not-exist case where we read zeroes if in recovery on the
assumption that later WAL records must be about to delete the file.
There is no way that future WAL records will make an existing segment
file shorter! So at this point don't we know that there is something
wrong with the backup itself?

Put another way, if you bring this up under 9.4.1, won't it also be
unable to access multixact 4624559 at this point?  Of course it won't
try to do so during recovery like 9.4.2 does, but I'm just trying to
understand how this is supposed to work for 9.4.1 if it needs to
access that multixact for other reasons once normal running is reached
(say you recover up to that checkpoint, and then run
pg_get_multixact_members, or a row has that xmax and its members to be
looked up by a vacuum or any normal transaction).  In other words,
isn't this a base backup that is somehow broken, not at all like the
pg_upgrade corruption case which involved the specific case of
multixact 1 and an entirely missing segment file, and 9.4.2 just tells
you about it sooner than 9.4.1?

For what it's worth, I've also spent a lot of time trying to reproduce
basebackup problems with multixact creation, vacuums and checkpoints
injected at various points between copying backup label, pg_multixact,
and pg_control.  I have so far failed to produce anything more
interesting than the 'reading zeroes' case (see attached
"copy-after-trunction.sh") and a case where the control file points at
a segment that doesn't exist, but it doesn't matter because the backup
label points at a checkpoint from a time when it did and
oldestMultiXactId is updated from there, and then procedes exactly as
it should (see "copy-before-truncation.sh").  I updated my scripts to
look a bit more like your nicely automated example (though mine use a
different trick to create small quantities of multixacts so they run
against unpatched master).  I have also been considering a scenario
where multixact ID wraparound occurs during basebackup with some
ordering that causes trouble, but I don't yet see why it would break
if you replay the WAL from the backup label checkpoint (and I think
the repro would take days/weeks to run...)

-- 
Thomas Munro
http://www.enterprisedb.com


copy-after-truncation.sh
Description: Bourne shell script


copy-before-truncation.sh
Description: Bourne shell script

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
> > Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning
> > the disk it'll always get one at a segment boundary, right? I'm not sure
> > that's actually ok; because the value at the beginning of the segment
> > can very well end up being a 0, as MaybeExtendOffsetSlru() will have
> > filled the page with zeros.
> >
> > I think that should be harmless, the worst that can happen is that
> > oldestOffset errorneously is 0, which should be correct, even though
> > possibly overly conservative, in these cases.
> 
> Uh oh.  That seems like a real bad problem for this approach.  What
> keeps that from being the opposite of too conservative?  There's no
> "safe" value in a circular numbering space.

I think it *might* (I'm really jetlagged) be fine because that'll only
happen after a upgrade from < 9.3. And in that case we initialize
nextOffset to 0. That ought to safe us?

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas
On Tue, Jun 2, 2015 at 4:19 PM, Andres Freund  wrote:
> I'm not really convinced tying things closer to having done trimming is
> easier to understand than tying things to recovery having finished.
>
> E.g.
> if (did_trim)
> oldestOffset = GetOldestReferencedOffset(oldest_datminmxid);
> imo is harder to understand than if (!InRecovery).
>
> Maybe we could just name it finishedStartup and rename the functions 
> accordingly?

Basing that particular call site on InRecovery doesn't work, because
InRecovery isn't set early enough.  But I'm fine to rename it to
whatever.

> Maybe it's worthwhile to add a 'NB: At this stage the data directory is
> not yet necessarily consistent' StartupMultiXact's comments, to avoid
> reintroducing problems like this?

Sure.

>>   /*
>> +  * We can read this without a lock, because it only changes when 
>> nothing
>> +  * else is running.
>> +  */
>> + did_trim = MultiXactState->didTrimMultiXact;
>
> Err, Hot Standby? It might be ok to not lock, but the comment is
> definitely wrong. I'm inclined to simply use locking, this doesn't look
> sufficiently critical performancewise.

/me nods.  Good point.

> Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning
> the disk it'll always get one at a segment boundary, right? I'm not sure
> that's actually ok; because the value at the beginning of the segment
> can very well end up being a 0, as MaybeExtendOffsetSlru() will have
> filled the page with zeros.
>
> I think that should be harmless, the worst that can happen is that
> oldestOffset errorneously is 0, which should be correct, even though
> possibly overly conservative, in these cases.

Uh oh.  That seems like a real bad problem for this approach.  What
keeps that from being the opposite of too conservative?  There's no
"safe" value in a circular numbering space.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-01 14:22:32 -0400, Robert Haas wrote:

> commit d33b4eb0167f465edb00bd6c0e1bcaa67dd69fe9
> Author: Robert Haas 
> Date:   Fri May 29 14:35:53 2015 -0400
> 
> foo

Hehe!

> diff --git a/src/backend/access/transam/multixact.c 
> b/src/backend/access/transam/multixact.c
> index 9568ff1..aca829d 100644
> --- a/src/backend/access/transam/multixact.c
> +++ b/src/backend/access/transam/multixact.c
> @@ -199,8 +199,9 @@ typedef struct MultiXactStateData
>   MultiXactOffset nextOffset;
>  
>   /*
> -  * Oldest multixact that is still on disk.  Anything older than this
> -  * should not be consulted.  These values are updated by vacuum.
> +  * Oldest multixact that may still be referenced from a relation.
> +  * Anything older than this should not be consulted.  These values are
> +  * updated by vacuum.
>*/
>   MultiXactId oldestMultiXactId;
>   Oid oldestMultiXactDB;
> @@ -213,6 +214,18 @@ typedef struct MultiXactStateData
>*/
>   MultiXactId lastCheckpointedOldest;
>  
> + /*
> +  * This is the oldest file that actually exist on the disk.  This value
> +  * is initialized by scanning pg_multixact/offsets, and subsequently
> +  * updated each time we complete a truncation.  We need a flag to
> +  * indicate whether this has been initialized yet.
> +  */
> + MultiXactId oldestMultiXactOnDisk;
> + boololdestMultiXactOnDiskValid;
> +
> + /* Has TrimMultiXact been called yet? */
> + booldidTrimMultiXact;

I'm not really convinced tying things closer to having done trimming is
easier to understand than tying things to recovery having finished.

E.g.
if (did_trim)
oldestOffset = GetOldestReferencedOffset(oldest_datminmxid);
imo is harder to understand than if (!InRecovery).

Maybe we could just name it finishedStartup and rename the functions 
accordingly?

> @@ -1956,12 +1971,6 @@ StartupMultiXact(void)
>*/
>   pageno = MXOffsetToMemberPage(offset);
>   MultiXactMemberCtl->shared->latest_page_number = pageno;
> -
> - /*
> -  * compute the oldest member we need to keep around to avoid old member
> -  * data overrun.
> -  */
> - DetermineSafeOldestOffset(MultiXactState->oldestMultiXactId);
>  }

Maybe it's worthwhile to add a 'NB: At this stage the data directory is
not yet necessarily consistent' StartupMultiXact's comments, to avoid
reintroducing problems like this?

>   /*
> +  * We can read this without a lock, because it only changes when nothing
> +  * else is running.
> +  */
> + did_trim = MultiXactState->didTrimMultiXact;

Err, Hot Standby? It might be ok to not lock, but the comment is
definitely wrong. I'm inclined to simply use locking, this doesn't look
sufficiently critical performancewise.

> +static MultiXactOffset
> +GetOldestReferencedOffset(MultiXactId oldestMXact)
> +{
> + MultiXactId earliest;
> + MultiXactOffset oldestOffset;
> +
> + /*
> +  * Because of bugs in early 9.3.X and 9.4.X releases (see comments in
> +  * TrimMultiXact for details), oldest_datminmxid might point to a
> +  * nonexistent multixact.  If so, use the oldest one that actually 
> +  * exists.  Anything before this can't be successfully used anyway.
> +  */
> + earliest = GetOldestMultiXactOnDisk();
> + if (MultiXactIdPrecedes(oldestMXact, earliest))
> + oldestMXact = earliest;

Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning
the disk it'll always get one at a segment boundary, right? I'm not sure
that's actually ok; because the value at the beginning of the segment
can very well end up being a 0, as MaybeExtendOffsetSlru() will have
filled the page with zeros.

I think that should be harmless, the worst that can happen is that
oldestOffset errorneously is 0, which should be correct, even though
possibly overly conservative, in these cases.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-02 11:49:56 -0400, Robert Haas wrote:
> On Tue, Jun 2, 2015 at 11:44 AM, Andres Freund  wrote:
> > On 2015-06-02 11:37:02 -0400, Robert Haas wrote:
> >> The exact circumstances under which we're willing to replace a
> >> relminmxid with a newly-computed one that differs are not altogether
> >> clear to me, but there's an "if" statement protecting that logic, so
> >> there are some circumstances in which we'll leave the existing value
> >> intact.
> >
> > I guess we'd have to change that then.
> 
> Yeah, but first we'd need to assess why it's like that.  Tom was the
> one who installed the current logic, but I haven't been able to fully
> understand it.

We're talking about:
/* Similarly for relminmxid */
if (MultiXactIdIsValid(minmulti) &&
pgcform->relminmxid != minmulti &&
(MultiXactIdPrecedes(pgcform->relminmxid, minmulti) ||
 MultiXactIdPrecedes(ReadNextMultiXactId(), 
pgcform->relminmxid)))
{
pgcform->relminmxid = minmulti;
dirty = true;
}

right? Before that change (78db307bb/87f830e0ce) we only updated
relminmxid if the new value was newer than the old one. That's to avoid
values from going backwards, e.g. when a relation is first VACUUM
FREEZEd and then a normal VACUUM is performed (these values are
unfortunately based on the cutoff values instead of the observed
minima). The new thing is the || MultiXactIdPrecedes(ReadNextMultiXactId(), 
pgcform->relminmxid)
which is there to recognize values from the future. E.g. the 1
errorneously left in place by pg_upgrade.

Makes sense?

> >> It would similarly do so when the oldest MXID reference in the
> >> relation is in fact 1, but that value can't be vacuumed away yet.
> >
> > I'd thought of just forcing consumption of one multixact in that
> > case. Not pretty, but imo acceptable.
> 
> What if multixact 1 still has living members?

I think we should just trigger the logic if 1 is below the multi
freezing horizon - in which case a) the 1 will automatically be
replaced, because the horizon is newer b) it can't have a living member
anyway.

- Andres


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas
On Tue, Jun 2, 2015 at 11:44 AM, Andres Freund  wrote:
> On 2015-06-02 11:37:02 -0400, Robert Haas wrote:
>> The exact circumstances under which we're willing to replace a
>> relminmxid with a newly-computed one that differs are not altogether
>> clear to me, but there's an "if" statement protecting that logic, so
>> there are some circumstances in which we'll leave the existing value
>> intact.
>
> I guess we'd have to change that then.

Yeah, but first we'd need to assess why it's like that.  Tom was the
one who installed the current logic, but I haven't been able to fully
understand it.

>> It would similarly do so when the oldest MXID reference in the
>> relation is in fact 1, but that value can't be vacuumed away yet.
>
> I'd thought of just forcing consumption of one multixact in that
> case. Not pretty, but imo acceptable.

What if multixact 1 still has living members?

>> Also, the database might be really big.  Even if it were true that a
>> full scan of every table would get us out of this state, describing
>> the time that it would take to do that as "relatively short" seems to
>> me to be considerably understating the impact of a full-cluster
>> VACUUM.
>
> Well. We're dealing with a corrupted cluster. Having a way out that's
> done automatically, even if it takes a long while, is pretty good
> already. In many cases the corruption (i.e. pg_upgrade) happened long
> ago, so the table's relminmxid will already have been recomputed.  I
> think that's acceptable.

I'm a long way from being convinced the automated recovery is
possible.  There are so many different scenarios here that it's very
difficult to reason generally about what the "right" thing to do is.
I agree it would be nice if we had it, though.

>> With regard to the more general question of WAL-logging this, are you
>> going to work on that?  Are you hoping Alvaro or I will work on that?
>> Should we draw straws?  It seems like somebody needs to do it.
>
> I'm willing to invest the time to develop an initial version, but I'll
> need help evaluating it. I don't have many testing resources available
> atm, and I'm not going to trust stuff I developed while travelling by
> just looking at the code.

I'm willing to help with that.  Hopefully I'm not the only one, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas
On Tue, Jun 2, 2015 at 11:36 AM, Andres Freund  wrote:
>> That would be a departure from the behavior of every existing release
>> that includes this code based on, to my knowledge, zero trouble
>> reports.
>
> On the other hand we're now at about bug #5 attributeable to the odd way
> truncation works for multixacts. It's obviously complex and hard to get
> right. It makes it harder to cope with the wrong values left in
> datminxid etc. So I'm still wondering whether fixing this for good isn't
> the better approach.

It may well be.  But I think we should do something more surgical
first.  Perhaps we can justify the pain and risk of making changes to
the WAL format in the back-branches, but let's not do it in a rush.
If we can get this patch to a state where it undoes the damage
inflicted in 9.3.7/9.4.2, then we will be in a state where we have as
much reliability as we had in 9.3.6 plus the protections against
member-space wraparound added in 9.3.7 - which, like the patch I'm
proposing now, were directly motivated by multiple, independent bug
reports.  That seems like a good place to get to.  If nothing else, it
will buy us some time to figure out what else we want to do.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-02 11:37:02 -0400, Robert Haas wrote:
> The exact circumstances under which we're willing to replace a
> relminmxid with a newly-computed one that differs are not altogether
> clear to me, but there's an "if" statement protecting that logic, so
> there are some circumstances in which we'll leave the existing value
> intact.

I guess we'd have to change that then.

> It would similarly do so when the oldest MXID reference in the
> relation is in fact 1, but that value can't be vacuumed away yet.

I'd thought of just forcing consumption of one multixact in that
case. Not pretty, but imo acceptable.

> Also, the database might be really big.  Even if it were true that a
> full scan of every table would get us out of this state, describing
> the time that it would take to do that as "relatively short" seems to
> me to be considerably understating the impact of a full-cluster
> VACUUM.

Well. We're dealing with a corrupted cluster. Having a way out that's
done automatically, even if it takes a long while, is pretty good
already. In many cases the corruption (i.e. pg_upgrade) happened long
ago, so the table's relminmxid will already have been recomputed.  I
think that's acceptable.

> With regard to the more general question of WAL-logging this, are you
> going to work on that?  Are you hoping Alvaro or I will work on that?
> Should we draw straws?  It seems like somebody needs to do it.

I'm willing to invest the time to develop an initial version, but I'll
need help evaluating it. I don't have many testing resources available
atm, and I'm not going to trust stuff I developed while travelling by
just looking at the code.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas
On Tue, Jun 2, 2015 at 11:27 AM, Andres Freund  wrote:
> On 2015-06-02 11:16:22 -0400, Robert Haas wrote:
>> I'm having trouble figuring out what to do about this.  I mean, the
>> essential principle of this patch is that if we can't count on
>> relminmxid, datminmxid, or the control file to be accurate, we can at
>> least look at what is present on the disk.  If we also cannot count on
>> that to be accurate, we are left without any reliable source of
>> information.  Consider a hypothetical cluster where all our stored
>> minmxids of whatever form are corrupted (say, all change to 1) and in
>> addition there are stray files in pg_multixact.  I don't think there's
>> really any way to get ourselves out of trouble in that scenario.
>
> If we were to truncate after vacuum, and only on the primary (via WAL
> logging), we could, afaics, just rely on all the values to be
> recomputed. I mean relminmxid will be recomputed after a vacuum, and
> thus, after some time, will datminmxid and the control file value.  We
> could just force a value of 1 to always trigger anti-wraparound vacuums
> (or wait for that to happen implicitly, to delay the impact?). That'll
> then should then fix the problem in a relatively short amount of time?

The exact circumstances under which we're willing to replace a
relminmxid with a newly-computed one that differs are not altogether
clear to me, but there's an "if" statement protecting that logic, so
there are some circumstances in which we'll leave the existing value
intact.  If we force non-stop vacuuming in that scenario, autovacuum
will just run like crazy without accomplishing anything, which
wouldn't be good.  It would similarly do so when the oldest MXID
reference in the relation is in fact 1, but that value can't be
vacuumed away yet.

Also, the database might be really big.  Even if it were true that a
full scan of every table would get us out of this state, describing
the time that it would take to do that as "relatively short" seems to
me to be considerably understating the impact of a full-cluster
VACUUM.

With regard to the more general question of WAL-logging this, are you
going to work on that?  Are you hoping Alvaro or I will work on that?
Should we draw straws?  It seems like somebody needs to do it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-02 11:29:24 -0400, Robert Haas wrote:
> On Tue, Jun 2, 2015 at 8:56 AM, Andres Freund  wrote:
> > But what *definitely* looks wrong to me is that a TruncateMultiXact() in
> > this scenario now (since a couple weeks ago) does a
> > SimpleLruReadPage_ReadOnly() in the members slru via
> > find_multixact_start(). That just won't work acceptably when we're not
> > yet consistent. There very well could not be a valid members segment at
> > that point?  Am I missing something?
> 
> Yes: that code isn't new.

Good point.

> TruncateMultiXact() called SimpleLruReadPage_ReadOnly() directly in
> 9.3.0 and every subsequent release until 9.3.7/9.4.2.

But back then TruncateMultiXact() wasn't called during recovery. But
you're right in that it didn't seem to have reproduced attributable
bugreprorts since 9.3.2 where vacuuming during recovery was
introduced. So it indeed doesn't seem as urgent as fixing the new
callsites.

> That would be a departure from the behavior of every existing release
> that includes this code based on, to my knowledge, zero trouble
> reports.

On the other hand we're now at about bug #5 attributeable to the odd way
truncation works for multixacts. It's obviously complex and hard to get
right. It makes it harder to cope with the wrong values left in
datminxid etc. So I'm still wondering whether fixing this for good isn't
the better approach.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas
On Tue, Jun 2, 2015 at 8:56 AM, Andres Freund  wrote:
> But what *definitely* looks wrong to me is that a TruncateMultiXact() in
> this scenario now (since a couple weeks ago) does a
> SimpleLruReadPage_ReadOnly() in the members slru via
> find_multixact_start(). That just won't work acceptably when we're not
> yet consistent. There very well could not be a valid members segment at
> that point?  Am I missing something?

Yes: that code isn't new.

TruncateMultiXact() called SimpleLruReadPage_ReadOnly() directly in
9.3.0 and every subsequent release until 9.3.7/9.4.2.  The only thing
that's changed is that we've moved that logic into a function called
find_multixact_start() instead of having it directly inside that
function.  We did that because we needed to use the same logic in some
other places.  The reason why 9.3.7/9.4.2 are causing problems for
people that they didn't have previously is because those new,
additional call sites were poorly chosen and didn't include adequate
protection against calling that function with an invalid input value.
What this patch is about is getting back to the situation that we were
in from 9.3.0 - 9.3.6 and 9.4.0 - 9.4.1, where TruncateMultiXact() did
the thing that you're complaining about here but no one else did.

>From my point of view, I think that you are absolutely right to
question what's going on in TruncateMultiXact().  It's kooky, and
there may well be bugs buried there.  But I don't think fixing that
should be the priority right now, because we have zero reports of
problems attributable to that logic.  I think the priority should be
on undoing the damage that we did in 9.3.7/9.4.2, when we made other
places to do the same thing.  We started getting trouble reports
attributable to those changes *almost immediately*, which means that
whether or not TruncateMultiXact() is broken, these new call sites
definitely are.  I think we really need to fix those new places ASAP.

> I think at the very least we'll have to skip this step while not yet
> consistent. That really sucks, because we'll possibly end up with
> multixacts that are completely filled by the time we've reached
> consistency.

That would be a departure from the behavior of every existing release
that includes this code based on, to my knowledge, zero trouble
reports.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-02 11:16:22 -0400, Robert Haas wrote:
> I'm having trouble figuring out what to do about this.  I mean, the
> essential principle of this patch is that if we can't count on
> relminmxid, datminmxid, or the control file to be accurate, we can at
> least look at what is present on the disk.  If we also cannot count on
> that to be accurate, we are left without any reliable source of
> information.  Consider a hypothetical cluster where all our stored
> minmxids of whatever form are corrupted (say, all change to 1) and in
> addition there are stray files in pg_multixact.  I don't think there's
> really any way to get ourselves out of trouble in that scenario.

If we were to truncate after vacuum, and only on the primary (via WAL
logging), we could, afaics, just rely on all the values to be
recomputed. I mean relminmxid will be recomputed after a vacuum, and
thus, after some time, will datminmxid and the control file value.  We
could just force a value of 1 to always trigger anti-wraparound vacuums
(or wait for that to happen implicitly, to delay the impact?). That'll
then should then fix the problem in a relatively short amount of time?


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-01 14:22:32 -0400, Robert Haas wrote:
> On Mon, Jun 1, 2015 at 4:58 AM, Andres Freund  wrote:
> > The lack of WAL logging actually has caused problems in the 9.3.3 (?)
> > era, where we didn't do any truncation during recovery...
> 
> Right, but now we're piggybacking on the checkpoint records, and I
> don't have any evidence that this approach can't be made robust.  It's
> possible that it can't be made robust, but that's not currently clear.

Well, it's possible that it can be made work without problems. But I
think robust is something different. Having to look at slrus, during
recovery, to find out what to truncate puts more intelligence/complexity
in the recovery path than I'm comfortable with.

> >> By the time we've reached the minimum recovery point, they will have
> >> been recreated by the same WAL records that created them in the first
> >> place.
> >
> > I'm not sure that's true. I think we could end up errorneously removing
> > files that were included in the base backup. Anyway, let's focus on your
> > patch for now.
> 
> OK, but I am interested in discussing the other thing too.  I just
> can't piece together the scenario myself - there may well be one.  The
> base backup will begin replay from the checkpoint caused by
> pg_start_backup() and remove anything that wasn't there at the start
> of the backup.  But all of that stuff should get recreated by the time
> we reach the minimum recovery point (end of backup).

I'm not sure if it's reprouceably borked. What about this scenario:
1) pg_start_backup() is called, creates a checkpoint.
2) 2**31 multixacts are created, possibly with several checkpoints
   inbetween
3) pg_multixact is copied
4) basebackup finishes

Unless I'm missing something this will lead to a crash recovery startup
where the first call to TruncateMultiXact() will have
MultiXactState->lastCheckpointedOldest wildly inconsistent with
GetOldestMultiXactOnDisk() return value. Possibly leading to truncation
being skipped errorneously.  Whether that's a problem I'm not yet
entirely sure.

But what *definitely* looks wrong to me is that a TruncateMultiXact() in
this scenario now (since a couple weeks ago) does a
SimpleLruReadPage_ReadOnly() in the members slru via
find_multixact_start(). That just won't work acceptably when we're not
yet consistent. There very well could not be a valid members segment at
that point?  Am I missing something?

> > I'm more worried about the cases where we didn't ever actually "badly
> > wrap around" (i.e. overwrite needed data); but where that's not clear on
> > the standby because the base backup isn't in a consistent state.
> 
> I agree. The current patch tries to make it so that we never call
> find_multixact_start() while in recovery, but it doesn't quite
> succeed: the call in TruncateMultiXact still happens during recovery,
> but only once we're sure that the mxact we plan to call it on actually
> exists on disk.  That won't be called until we replay the first
> checkpoint, but that might still be prior to consistency.

It'll pretty much *always* be before we reach consistency, right? It'll
called on the checkpoint created by pg_start_backup()?

I don't think the presence check (that's GetOldestMultiXactOnDisk() in
TruncateMultiXact(), right) helps very much. There's no guarantee at all
that offsets and members are in any way consistent with each other. Or
in themselves for that matter, the copy could very well have been in the
middle of a write the slru page.

I think at the very least we'll have to skip this step while not yet
consistent. That really sucks, because we'll possibly end up with
multixacts that are completely filled by the time we've reached
consistency.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Alvaro Herrera
Alvaro Herrera wrote:

> Anyway here's a quick script to almost-reproduce the problem.

Meh.  Really attached now.

I also wanted to post the error messages we got:

2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode
2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: restored log file 
"000173DD00AD" from archive
2015-05-27 16:15:18 UTC [4782]: [5-1] user=,db= FATAL: could not access status 
of transaction 4624559
2015-05-27 16:15:18 UTC [4782]: [6-1] user=,db= DETAIL: Could not read from 
file "pg_multixact/offsets/0046" at offset 147456: Success.
2015-05-27 16:15:18 UTC [4778]: [4-1] user=,db= LOG: startup process (PID 4782) 
exited with exit code 1
2015-05-27 16:15:18 UTC [4778]: [5-1] user=,db= LOG: aborting startup due to 
startup process failure

We pg_xlogdumped the offending segment and see that there is a
checkpoint record with oldestMulti=4624559.  Curiously, there are no
other records with rmgr=MultiXact in that segment.

Note that the file exists (otherwise the error would say "could not
open" rather than "could not read"); this is also why errno says
"Success" rather than some actual error; this is the code:

errno = 0;
if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ)
{
slru_errcause = SLRU_READ_FAILED;
slru_errno = errno;
CloseTransientFile(fd);
return false;
}

My guess is that the file existed, and perhaps had one or more pages,
but the wanted page doesn't exist, so we tried to read but got 0 bytes
back.  read() returns 0 in this case but doesn't set errno.

I didn't find a way to set things so that the file exists but is of
shorter contents than oldestMulti by the time the checkpoint record is
replayed.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


repro-chkpt-replay-failure.sh
Description: Bourne shell script

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Alvaro Herrera
Alvaro Herrera wrote:
> Robert Haas wrote:

> > In the process of investigating this, we found a few other things that
> > seem like they may also be bugs:
> > 
> > - As noted upthread, replaying an older checkpoint after a newer
> > checkpoint has already happened may lead to similar problems.  This
> > may be possible when starting from an online base backup; or when
> > restarting a standby that did not perform a restartpoint when
> > replaying the last checkpoint before the shutdown.
> 
> I'm going through this one now, as it's closely related and caused
> issues for us.

FWIW I've spent a rather long while trying to reproduce the issue, but
haven't been able to figure out.  Thomas already commented on it: the
server notices that a file is missing and, instead of failing, it "reads
the file as zeroes".  This is because of this hack in slru.c, which is
there to cover for a pg_clog replay consideration:

/*
 * In a crash-and-restart situation, it's possible for us to receive
 * commands to set the commit status of transactions whose bits are in
 * already-truncated segments of the commit log (see notes in
 * SlruPhysicalWritePage).  Hence, if we are InRecovery, allow the case
 * where the file doesn't exist, and return zeroes instead.
 */
fd = OpenTransientFile(path, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR);
if (fd < 0)
{
if (errno != ENOENT || !InRecovery)
{
slru_errcause = SLRU_OPEN_FAILED;
slru_errno = errno;
return false;
}

ereport(LOG,
(errmsg("file \"%s\" doesn't exist, reading as 
zeroes",
path)));
MemSet(shared->page_buffer[slotno], 0, BLCKSZ);
return true;
}

I was able to cause an actual problem by #ifdefing out the "if
errno/inRcovery" line, of course, but how can I be sure that fixing that
would also fix my customer's problem?  That's no good.

Anyway here's a quick script to almost-reproduce the problem.  I
constructed many variations of this, trying to find the failing one, to
no avail.  Note I'm using the pg_burn_multixact() function to create
multixacts quickly (this is cheating in itself, but it seems to me that
if I'm not able to reproduce the problem with this, I'm not able to do
so without it either.)  This script takes an exclusive backup
(cp -pr) excluding pg_multixact, then does some multixact stuff
(including truncation), then copies pg_multixact.  Evidently I'm missing
some critical element but I can't see what it is.

Another notable variant was to copy pg_multixact first, then do stuff
(create lots of multixacts, then mxact-freeze/checkpoint),
then copy the rest of the data dir.  No joy either: when replay starts,
the missing multixacts are created on the standby and so by the time we
reach the checkpoint record the pg_multixact/offset file has already
grown to the necessary size.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Alvaro Herrera
Thomas Munro wrote:

> > - There's a third possible problem related to boundary cases in
> > SlruScanDirCbRemoveMembers, but I don't understand that one well
> > enough to explain it.  Maybe Thomas can jump in here and explain the
> > concern.
> 
> I noticed something in passing which is probably not harmful, and not
> relevant to this bug report, it was just a bit confusing while
> testing: 

Another thing I noticed in passing is that
SimpleLruDoesPhysicalFileExist uses SLRU_OPEN_FAILED in slru_errcase for
the case where lseek() fails, rather than SLRU_SEEK_FAILED.  This is
probably completely harmless, because that's lseek(..., SEEK_END), but
seems rather odd anyway.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Robert Haas
On Mon, Jun 1, 2015 at 4:58 AM, Andres Freund  wrote:
>> I'm probably biased here, but I think we should finish reviewing,
>> testing, and committing my patch before we embark on designing this.
>
> Probably, yes. I am wondering whether doing this immediately won't end
> up making some things simpler and more robust though.

I'm open to being convinced of that, but as of this moment I'm not
seeing any clear-cut evidence that we need to go so far.

>> So far we have no reports of trouble attributable to the lack of the
>> WAL-logging support discussed here, as opposed to several reports of
>> trouble from the status quo within days of release.
>
> The lack of WAL logging actually has caused problems in the 9.3.3 (?)
> era, where we didn't do any truncation during recovery...

Right, but now we're piggybacking on the checkpoint records, and I
don't have any evidence that this approach can't be made robust.  It's
possible that it can't be made robust, but that's not currently clear.

>> By the time we've reached the minimum recovery point, they will have
>> been recreated by the same WAL records that created them in the first
>> place.
>
> I'm not sure that's true. I think we could end up errorneously removing
> files that were included in the base backup. Anyway, let's focus on your
> patch for now.

OK, but I am interested in discussing the other thing too.  I just
can't piece together the scenario myself - there may well be one.  The
base backup will begin replay from the checkpoint caused by
pg_start_backup() and remove anything that wasn't there at the start
of the backup.  But all of that stuff should get recreated by the time
we reach the minimum recovery point (end of backup).

>> If, in the previous
>> replay, we had wrapped all the way around, some of the stuff we keep
>> may actually already have been overwritten by future WAL records, but
>> they'll be overwritten again.  Now, that could mess up our
>> determination of which members to remove, I guess, but I'm not clear
>> that actually matters either: if the offsets space has wrapped around,
>> the members space will certainly have wrapped around as well, so we
>> can remove anything we like at this stage and we're still OK.  I agree
>> this is ugly the way it is, but where is the actual bug?
>
> I'm more worried about the cases where we didn't ever actually "badly
> wrap around" (i.e. overwrite needed data); but where that's not clear on
> the standby because the base backup isn't in a consistent state.

I agree. The current patch tries to make it so that we never call
find_multixact_start() while in recovery, but it doesn't quite
succeed: the call in TruncateMultiXact still happens during recovery,
but only once we're sure that the mxact we plan to call it on actually
exists on disk.  That won't be called until we replay the first
checkpoint, but that might still be prior to consistency.

Since I forgot to attach the revised patch with fixes for the points
Noah mentioned to that email, here it is attached to this one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
commit d33b4eb0167f465edb00bd6c0e1bcaa67dd69fe9
Author: Robert Haas 
Date:   Fri May 29 14:35:53 2015 -0400

foo

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 9568ff1..aca829d 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -199,8 +199,9 @@ typedef struct MultiXactStateData
 	MultiXactOffset nextOffset;
 
 	/*
-	 * Oldest multixact that is still on disk.  Anything older than this
-	 * should not be consulted.  These values are updated by vacuum.
+	 * Oldest multixact that may still be referenced from a relation.
+	 * Anything older than this should not be consulted.  These values are
+	 * updated by vacuum.
 	 */
 	MultiXactId oldestMultiXactId;
 	Oid			oldestMultiXactDB;
@@ -213,6 +214,18 @@ typedef struct MultiXactStateData
 	 */
 	MultiXactId lastCheckpointedOldest;
 
+	/*
+	 * This is the oldest file that actually exist on the disk.  This value
+	 * is initialized by scanning pg_multixact/offsets, and subsequently
+	 * updated each time we complete a truncation.  We need a flag to
+	 * indicate whether this has been initialized yet.
+	 */
+	MultiXactId oldestMultiXactOnDisk;
+	bool		oldestMultiXactOnDiskValid;
+
+	/* Has TrimMultiXact been called yet? */
+	bool		didTrimMultiXact;
+
 	/* support for anti-wraparound measures */
 	MultiXactId multiVacLimit;
 	MultiXactId multiWarnLimit;
@@ -344,6 +357,8 @@ static char *mxstatus_to_string(MultiXactStatus status);
 /* management of SLRU infrastructure */
 static int	ZeroMultiXactOffsetPage(int pageno, bool writeXlog);
 static int	ZeroMultiXactMemberPage(int pageno, bool writeXlog);
+static MultiXactOffset GetOldestMultiXactOnDisk(void);
+static MultiXactOffset GetOldestReferencedOffset(MultiXactId oldestMXact);
 static bool MultiXactOffsetPagePrecedes(int page1, int page2);
 static 

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Andres Freund
On 2015-05-31 07:51:59 -0400, Robert Haas wrote:
> > 1) We continue determining the oldest 
> > SlruScanDirectory(SlruScanDirCbFindEarliest)
> >on the master to find the oldest offsets segment to
> >truncate. Alternatively, if we determine it to be safe, we could use
> >oldestMulti to find that.
> > 2) SlruScanDirCbRemoveMembers is changed to return the range of members
> >to remove, instead of doing itself
> > 3) We wal log [oldest offset segment guaranteed to not be alive,
> >nextmulti) for offsets, and [oldest members segment guaranteed to not be 
> > alive,
> >nextmultioff), and redo truncations for the entire range during
> >recovery.
> >
> > I'm pretty tired right now, but this sounds doable.
> 
> I'm probably biased here, but I think we should finish reviewing,
> testing, and committing my patch before we embark on designing this.

Probably, yes. I am wondering whether doing this immediately won't end
up making some things simpler and more robust though.

> So far we have no reports of trouble attributable to the lack of the
> WAL-logging support discussed here, as opposed to several reports of
> trouble from the status quo within days of release.

The lack of WAL logging actually has caused problems in the 9.3.3 (?)
era, where we didn't do any truncation during recovery...

> By the time we've reached the minimum recovery point, they will have
> been recreated by the same WAL records that created them in the first
> place.

I'm not sure that's true. I think we could end up errorneously removing
files that were included in the base backup. Anyway, let's focus on your
patch for now.

> If, in the previous
> replay, we had wrapped all the way around, some of the stuff we keep
> may actually already have been overwritten by future WAL records, but
> they'll be overwritten again.  Now, that could mess up our
> determination of which members to remove, I guess, but I'm not clear
> that actually matters either: if the offsets space has wrapped around,
> the members space will certainly have wrapped around as well, so we
> can remove anything we like at this stage and we're still OK.  I agree
> this is ugly the way it is, but where is the actual bug?

I'm more worried about the cases where we didn't ever actually "badly
wrap around" (i.e. overwrite needed data); but where that's not clear on
the standby because the base backup isn't in a consistent state.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-31 Thread Robert Haas
On Sat, May 30, 2015 at 8:55 PM, Andres Freund  wrote:
> Is oldestMulti, nextMulti - 1 really suitable for this? Are both
> actually guaranteed to exist in the offsets slru and be valid?  Hm. I
> guess you intend to simply truncate everything else, but just in
> offsets?

oldestMulti in theory is the right thing, I think, but in actuality we
know that some people have 1 here instead of the correct value.

>> One argument against this idea is that we may not want to keep a full
>> set of member files on standbys (due to disk space usage), but that's
>> what will happen unless we truncate during replay.
>
> I think that argument is pretty much the death-knell.=

Yes.  Truncating on the standby is really not optional.

>> > I think at least for 9.5+ we should a) invent proper truncation records
>> > for pg_multixact b) start storing oldestValidMultiOffset in pg_control.
>> > The current hack of scanning the directories to get knowledge we should
>> > have is a pretty bad hack, and we should not continue using it forever.
>> > I think we might end up needing to do a) even in the backbranches.
>>
>> Definitely agree with WAL-logging truncations; also +1 on backpatching
>> that to 9.3.  We already have experience with adding extra WAL records
>> on minor releases, and it didn't seem to have bitten too hard.
>
> I'm inclined to agree. My only problem is that I'm not sure whether we
> can find a way of doing all this without adding a pg_control field. Let
> me try to sketch this out:
>
> 1) We continue determining the oldest 
> SlruScanDirectory(SlruScanDirCbFindEarliest)
>on the master to find the oldest offsets segment to
>truncate. Alternatively, if we determine it to be safe, we could use
>oldestMulti to find that.
> 2) SlruScanDirCbRemoveMembers is changed to return the range of members
>to remove, instead of doing itself
> 3) We wal log [oldest offset segment guaranteed to not be alive,
>nextmulti) for offsets, and [oldest members segment guaranteed to not be 
> alive,
>nextmultioff), and redo truncations for the entire range during
>recovery.
>
> I'm pretty tired right now, but this sounds doable.

I'm probably biased here, but I think we should finish reviewing,
testing, and committing my patch before we embark on designing this.
So far we have no reports of trouble attributable to the lack of the
WAL-logging support discussed here, as opposed to several reports of
trouble from the status quo within days of release.

I'm having trouble reconstructing the series of events where what
you're worried about here really becomes a problem, and I think we
ought to have a very clear idea about that before back-patching
changes of this type.  It's true that if the state of the SLRU
directory is in the future, because recovery was restarted from an
earlier checkpoint, we might replay a checkpoint and remove some of
those files from the future.  But so what?  By the time we've reached
the minimum recovery point, they will have been recreated by the same
WAL records that created them in the first place.  If, in the previous
replay, we had wrapped all the way around, some of the stuff we keep
may actually already have been overwritten by future WAL records, but
they'll be overwritten again.  Now, that could mess up our
determination of which members to remove, I guess, but I'm not clear
that actually matters either: if the offsets space has wrapped around,
the members space will certainly have wrapped around as well, so we
can remove anything we like at this stage and we're still OK.  I agree
this is ugly the way it is, but where is the actual bug?

As far as your actual outline goes, I think if we do this, we need to
be very careful about step #2.  Right now, we decide what we need to
keep and then remove everything else, but that's kind of wonky because
new stuff may be getting created at the same time, so we keep
adjusting our idea of exactly what needs to be removed.  It would be
far better to invert that logic: decide what needs to be removed -
presumably, everything from the oldest member that now exists up until
some later point - and then remove precisely that stuff and nothing
else.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-30 Thread Andres Freund
On 2015-05-30 00:52:37 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
>
> > I considered for a second whether the solution for that could be to not
> > truncate while inconsistent - but I think that doesn't solve anything as
> > then we can end up with directories where every single offsets/member
> > file exists.
>
> Hang on a minute.  We don't need to scan any files to determine the
> truncate point for offsets; we have the valid range for them in
> pg_control, as nextMulti + oldestMulti.  And using those end points, we
> can look for the offsets corresponding to each, and determine the member
> files corresponding to the whole set; it doesn't matter what other files
> exist, we just remove them all.  In other words, maybe we can get away
> with considering truncation separately for offset and members on
> recovery: do it like today for offsets (i.e. at each restartpoint), but
> do it only in TrimMultiXact for members.

Is oldestMulti, nextMulti - 1 really suitable for this? Are both
actually guaranteed to exist in the offsets slru and be valid?  Hm. I
guess you intend to simply truncate everything else, but just in
offsets?

> One argument against this idea is that we may not want to keep a full
> set of member files on standbys (due to disk space usage), but that's
> what will happen unless we truncate during replay.

I think that argument is pretty much the death-knell.=

> > I think at least for 9.5+ we should a) invent proper truncation records
> > for pg_multixact b) start storing oldestValidMultiOffset in pg_control.
> > The current hack of scanning the directories to get knowledge we should
> > have is a pretty bad hack, and we should not continue using it forever.
> > I think we might end up needing to do a) even in the backbranches.
>
> Definitely agree with WAL-logging truncations; also +1 on backpatching
> that to 9.3.  We already have experience with adding extra WAL records
> on minor releases, and it didn't seem to have bitten too hard.

I'm inclined to agree. My only problem is that I'm not sure whether we
can find a way of doing all this without adding a pg_control field. Let
me try to sketch this out:

1) We continue determining the oldest 
SlruScanDirectory(SlruScanDirCbFindEarliest)
   on the master to find the oldest offsets segment to
   truncate. Alternatively, if we determine it to be safe, we could use
   oldestMulti to find that.
2) SlruScanDirCbRemoveMembers is changed to return the range of members
   to remove, instead of doing itself
3) We wal log [oldest offset segment guaranteed to not be alive,
   nextmulti) for offsets, and [oldest members segment guaranteed to not be 
alive,
   nextmultioff), and redo truncations for the entire range during
   recovery.

I'm pretty tired right now, but this sounds doable.

Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Alvaro Herrera
Bruce Momjian wrote:

> I think we need to step back and look at the brain power required to
> unravel the mess we have made regarding multi-xact and fixes.  (I bet
> few people can even remember which multi-xact fixes went into which
> releases --- I can't.)  Instead of working on actual features, we are
> having to do this complex diagnosis because we didn't do a thorough
> analysis at the time a pattern of multi-xact bugs started to appear. 
> Many projects deal with this compound bug debt regularly, but we have
> mostly avoided this fate.

Well, it's pretty obvious that if we had had a glimpse of the nature of
the issues back then, we wouldn't have committed the patch.  The number
of ends that we left loose we now know to be huge, but we didn't know
that back then.  (I, at least, certainly didn't.)

Simon told me when this last one showed up that what we need at this
point is a way to turn the whole thing off to stop it from affecting
users anymore.  I would love to be able to do that, because the whole
situation has become stressing, but I don't see a way.  Heck, if we
could implement Heikki's TED idea or something similar, I would be up
for back-patching it so that people can pg_upgrade from postgresql-9.3
to postgresql-ted-9.3, and just forget any further multixact pain.
Don't think that's really doable, though.  As far as I can see, for
branches 9.3 and 9.4 the best we can do is soldier on and get these bugs
fixed, hoping that this time they are really the last [serious] ones.

For 9.5, I concur with Andres that we'd do good to change the way
truncations are done by WAL-logging more stuff and keep more data in
pg_control, to avoid all these nasty games.  And for 9.6, find a better
representation of the data so that the durable data is stored separately
from the volatile data.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Alvaro Herrera
Andres Freund wrote:

> I considered for a second whether the solution for that could be to not
> truncate while inconsistent - but I think that doesn't solve anything as
> then we can end up with directories where every single offsets/member
> file exists.

Hang on a minute.  We don't need to scan any files to determine the
truncate point for offsets; we have the valid range for them in
pg_control, as nextMulti + oldestMulti.  And using those end points, we
can look for the offsets corresponding to each, and determine the member
files corresponding to the whole set; it doesn't matter what other files
exist, we just remove them all.  In other words, maybe we can get away
with considering truncation separately for offset and members on
recovery: do it like today for offsets (i.e. at each restartpoint), but
do it only in TrimMultiXact for members.

One argument against this idea is that we may not want to keep a full
set of member files on standbys (due to disk space usage), but that's
what will happen unless we truncate during replay.

> I think at least for 9.5+ we should a) invent proper truncation records
> for pg_multixact b) start storing oldestValidMultiOffset in pg_control.
> The current hack of scanning the directories to get knowledge we should
> have is a pretty bad hack, and we should not continue using it forever.
> I think we might end up needing to do a) even in the backbranches.

Definitely agree with WAL-logging truncations; also +1 on backpatching
that to 9.3.  We already have experience with adding extra WAL records
on minor releases, and it didn't seem to have bitten too hard.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Thomas Munro
On Sat, May 30, 2015 at 1:46 PM, Andres Freund  wrote:
> On 2015-05-29 15:08:11 -0400, Robert Haas wrote:
>> It seems pretty clear that we can't effectively determine anything
>> about member wraparound until the cluster is consistent.
>
> I wonder if this doesn't actually hints at a bigger problem.  Currently,
> to determine where we need to truncate SlruScanDirectory() is
> used. That, afaics, could actually be a problem during recovery when
> we're not consistent.
>
> Consider the scenario where a very large database is copied while
> running. At the start of the backup we'll determine at which checkpoint
> recovery will start and store it in the label. After that the copy will
> start, copying everything slowly. That works because we expect recovery
> to fix things up.  The problem I see WRT multixacts is that the copied
> state of pg_multixact could be wildly different from the one at the
> label's checkpoint. During recovery, before reaching the first
> checkpoint, we'll create multixact files as used at the time of the
> checkpoint. But the rest of pg_multixact may be ahead 2**31 xacts.

Yes, I think the code in TruncateMultiXact that scans for the earliest
multixact only works when the segment files span at most 2^31 of
multixact space. If they span more than that, MultiXactIdPrecedes is
no long able to provide a total ordering, so of the scan may be wrong,
depending on the order that it encounters the files.

Incidentally, your description of that scenario gave me an idea for
how to reproduce a base backup that 9.4.2 or master can't start.  I
tried this first:

1.  Set up with max_wal_senders = 1, wal_level = hot_standby, initdb
2.  Create enough multixacts to fill a couple of segments in
pg_multixacts/offsets using "explode_mxact_members 99 1000" (create
foo table first)
3.  Start a base backup with logs, but break in
src/backend/replication/basebackup.c after
sendFileWithContent(BACKUP_LABEL_FILE, labelfile); and before sending
the contents of the data dir (including pg_multixacts)... (or just put
a big sleep in there)
4.  UPDATE pg_database SET datallowconn = true; vacuumdb --freeze
--all; CHECKPOINT;, see that offsets/ is now gone and
oldestMultiXid is 98001 in pg_control
5.  ... allow the server backend to continue; the basebackup completes.

Inspecting the new data directory, I see that offsets/ is not
present as expected, and pg_control contains the oldestMultiXid 98001.

Since pg_control was copied after pg_multixacts and my database didn't
move between those copies, it points to a valid multixact (unlike the
pg_upgrade scenario) and is able to start up, but does something
different again which may or may not be good, I'm not sure:

LOG:  database system was interrupted; last known up at 2015-05-30 14:30:23 NZST
LOG:  file "pg_multixact/offsets/" doesn't exist, reading as zeroes
LOG:  redo starts at 0/728
LOG:  consistent recovery state reached at 0/70C8898
LOG:  redo done at 0/70C8898
LOG:  last completed transaction was at log time 2015-05-30 14:30:17.261436+12
LOG:  database system is ready to accept connections

My next theory about how to get a FATAL during startup is something
like this:  Break in basebackup.c in between copying pg_multixacts and
copying pg_control (simulating a very large/slow file copy, perhaps if
'base' happens to get copied after 'pg_multixacts', though I don't
know if that's possible), and while it's stopped, generate some
offsets segments, vacuum --freeze --all, checkpoint and then create a
few more multixacts, then checkpoint again (so that oldestMultiXact is
not equal to nextMultiXact).  Continue.  Now pg_control's
oldestMultiXactId now points at a segment file that didn't exist when
pg_multixacts was copied.  I haven't managed to get this to work (ie
produce a FATAL) and I'm out of time for a little while, but wanted to
share this idea in case it helps someone.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Robert Haas
On Fri, May 29, 2015 at 9:46 PM, Andres Freund  wrote:
> On 2015-05-29 15:08:11 -0400, Robert Haas wrote:
>> It seems pretty clear that we can't effectively determine anything
>> about member wraparound until the cluster is consistent.
>
> I wonder if this doesn't actually hints at a bigger problem.  Currently,
> to determine where we need to truncate SlruScanDirectory() is
> used. That, afaics, could actually be a problem during recovery when
> we're not consistent.

I agree.  I actually meant to mention this in my previous email, but,
owing to exhaustion and burnout, didn't.

> I think at least for 9.5+ we should a) invent proper truncation records
> for pg_multixact b) start storing oldestValidMultiOffset in pg_control.
> The current hack of scanning the directories to get knowledge we should
> have is a pretty bad hack, and we should not continue using it forever.
> I think we might end up needing to do a) even in the backbranches.

That may be the right thing to do.  I'm concerned that changing the
behavior of master too much will make it every subsequent fix twice as
hard, because we'll have to do one fix in master and another fix in
the back-branches.  I'm also concerned that it will create even more
convoluted failure scenarios. The failure-to-start problem discussed
on this thread requires a chain of four (maybe three) different
PostgreSQL versions in order to create it, and the more things we go
change, the harder it's going to be to reason about this stuff.

The diseased and rotting elephant in the room here is that clusters
with bogus relminmxid, datminmxid, and/or oldestMultiXid values may
exist in the wild and we really have no plan to get rid of them.
78db307bb may have helped somewhat - although I'm haven't grokked what
it's about well enough to be sure - but it's certainly not a complete
solution, as this bug report itself illustrates rather well.  Unless
we figure out some clever solution that is not now apparent to me, or
impose a hard pg_upgrade compatibility break at some point, we
basically can't count on pg_control's "oldest multixact" information
to be correct ever again.  We may be running into clusters 15 years
from now that have problems that are just holdovers from what was
fixed in 9.3.5.

One thing I think we should definitely do is add one or two additional
fields to pg_controldata that get filled in by pg_upgrade.  One of
them should be "the oldest known catversion in the lineage of this
cluster" and the other should be "the most recent catverson in the
lineage of this cluster before this one".   Or maybe we should store
PG_VERSION_NUM values.  Or store both things.  I think that would make
troubleshooting this kind of problem a lot easier - just from the
pg_controldata output, you'd be able to tell whether the cluster had
been pg_upgraded, whether it had been pg_upgraded once or multiple
times, and at least some of the versions involved, without relying on
the user's memory of what they did and when.  Fortunately, Steve
Kellet had a pretty clear idea of what his history was, but not all
users know that kind of thing, and I've wanted it more than once while
troubleshooting.

Another thing I think we should do is add a field to pg_class that is
propagated by pg_upgrade and stores the most recent PG_VERSION_NUM
that is known to have performed a scan_all vacuum of the table.  This
would allow us to do things in the future like (a) force a full-table
vacuum of any table that hasn't been vacuumed since $BUGGYRELEASE or
(b) advise users to manually inspect the values and manually perform
said vacuum or (c) only believe that certain information about a table
is accurate if it's been full-scanned by a vacuum newer than
$BUGGYRELEASE.  It could also be used as part of a strategy for
reclaiming HEAP_MOVED_IN/HEAP_MOVED_OFF; e.g. you can't upgrade to
10.5, which repurposes those bits, unless you've done a scan_all
vacuum on every table with a release new enough to guarantee that
they're not used for their historical purpose.

> This problem isn't conflicting with most of the fixes you describe, so
> I'll continue with reviewing those.

Thank you.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Robert Haas
On Fri, May 29, 2015 at 3:08 PM, Robert Haas  wrote:
> It won't fix the fact that pg_upgrade is putting
> a wrong value into everybody's datminmxid field, which should really
> be addressed too, but I've been working on this for about three days
> virtually non-stop and I don't have the energy to tackle it right now.
> If anyone feels the urge to step into that breech, I think what it
> needs to do is: when upgrading from a 9.3-or-later instance, copy over
> each database's datminmxid into the corresponding database in the new
> cluster.

Bruce was kind enough to spend some time on IM with me this afternoon,
and I think this may actually be OK.  What pg_upgrade does is:

1. First, put next-xid into the relminmxid for all tables, including
catalog tables.  This is the correct behavior for upgrades from a
pre-9.3 release, and is correct for catalog tables in general.

2. Next, restoring the schema dump will set the relminmxid values for
all non-catalog tables to the value dumped from the old cluster.  At
this point, everything is fine provided that we are coming from a
release 9.3 or newer.  But if the old cluster is pre-9.3, it will have
dumped *zero* values for all of its relminmxid values; so all of the
user tables go from the correct value they had after step 1 to an
incorrect value.

3. Finally, if the old cluster is pre-9.3, repeat step 1, undoing the
damage done in step 2.

This is a bit convoluted, but I don't know of a reason why it
shouldn't work.  Sorry for the false alarm.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Andres Freund
On 2015-05-29 15:08:11 -0400, Robert Haas wrote:
> It seems pretty clear that we can't effectively determine anything
> about member wraparound until the cluster is consistent.

I wonder if this doesn't actually hints at a bigger problem.  Currently,
to determine where we need to truncate SlruScanDirectory() is
used. That, afaics, could actually be a problem during recovery when
we're not consistent.

Consider the scenario where a very large database is copied while
running. At the start of the backup we'll determine at which checkpoint
recovery will start and store it in the label. After that the copy will
start, copying everything slowly. That works because we expect recovery
to fix things up.  The problem I see WRT multixacts is that the copied
state of pg_multixact could be wildly different from the one at the
label's checkpoint. During recovery, before reaching the first
checkpoint, we'll create multixact files as used at the time of the
checkpoint. But the rest of pg_multixact may be ahead 2**31 xacts.

For clog and such that's not a problem because the truncation
etc. points are all stored in WAL - during recovery we just replay the
truncations that happened on the master, there's no need to look at the
data directory. And we won't access the clog before being consistent.

But for multixacts is different. To avoid ending up with
pg_multixact/*/* directories we have to do truncations during
recovery. As there's currently no truncation records we have to do that
scanning the data directory. But that state could be "from the future".

I considered for a second whether the solution for that could be to not
truncate while inconsistent - but I think that doesn't solve anything as
then we can end up with directories where every single offsets/member
file exists.  We could possibly try to fix that by always truncating
away slru segments in offsets that we know to be too old to exist in a
valid database. But achieving the same for members fries my brain.  It
also seems awfully risky.

I think at least for 9.5+ we should a) invent proper truncation records
for pg_multixact b) start storing oldestValidMultiOffset in pg_control.
The current hack of scanning the directories to get knowledge we should
have is a pretty bad hack, and we should not continue using it forever.
I think we might end up needing to do a) even in the backbranches.


Am I missing something?


This problem isn't conflicting with most of the fixes you describe, so
I'll continue with reviewing those.


Greetings,

Andres Freund


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Andres Freund
On 2015-05-29 15:49:53 -0400, Bruce Momjian wrote:
> I think we need to step back and look at the brain power required to
> unravel the mess we have made regarding multi-xact and fixes.  (I bet
> few people can even remember which multi-xact fixes went into which
> releases --- I can't.)  Instead of working on actual features, we are
> having to do this complex diagnosis because we didn't do a thorough
> analysis at the time a pattern of multi-xact bugs started to appear. 
> Many projects deal with this compound bug debt regularly, but we have
> mostly avoided this fate.

What is the consequences of that observation you're imagining?


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Steve Kehlet
On Fri, May 29, 2015 at 12:08 PM Robert Haas  wrote:

> OK, here's a patch.
>

I grabbed branch REL9_4_STABLE from git, and Robert got me a 9.4-specific
patch. I rebuilt, installed, and postgres started up successfully!  I did a
bunch of checks, had our app run several thousand SQL queries against it,
had a colleague check it out, and it looks good. Looking at top and ps, I
don't see anything funny (e.g. no processes spinning cpu, etc), things look
normal. Let me know if I can provide anything else.


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Bruce Momjian
On Thu, May 28, 2015 at 07:24:26PM -0400, Robert Haas wrote:
> On Thu, May 28, 2015 at 4:06 PM, Joshua D. Drake  
> wrote:
> > FTR: Robert, you have been a Samurai on this issue. Our many thanks.
> 
> Thanks!  I really appreciate the kind words.
> 
> So, in thinking through this situation further, it seems to me that
> the situation is pretty dire:
> 
> 1. If you pg_upgrade to 9.3 before 9.3.5, then you may have relminmxid
> or datminmxid values which are 1 instead of the correct value.
> Setting the value to 1 was too far in the past if your MXID counter is
> < 2B, and too far in the future if your MXID counter is > 2B.
> 
> 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid
> values which are equal to the next-mxid counter instead of the correct
> value; in other words, they are two new.
> 
> 3. If you pg_upgrade to 9.3.5, 9.3.6, 9.4.0, or 9.4.1, then you will
> have the first problem for tables in template databases, and the
> second one for the rest. (See 866f3017a.)

I think we need to step back and look at the brain power required to
unravel the mess we have made regarding multi-xact and fixes.  (I bet
few people can even remember which multi-xact fixes went into which
releases --- I can't.)  Instead of working on actual features, we are
having to do this complex diagnosis because we didn't do a thorough
analysis at the time a pattern of multi-xact bugs started to appear. 
Many projects deal with this compound bug debt regularly, but we have
mostly avoided this fate.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

  + Everyone has their own god. +


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Robert Haas
On Fri, May 29, 2015 at 12:43 PM, Robert Haas  wrote:
> Working on that now.

OK, here's a patch.  Actually two patches, differing only in
whitespace, for 9.3 and for master (ha!).  I now think that the root
of the problem here is that DetermineSafeOldestOffset() and
SetMultiXactIdLimit() were largely ignorant of the possibility that
they might be called at points in time when the cluster was
inconsistent.  SetMultiXactIdLimit() bracketed certain parts of its
logic with if (!InRecovery), but those guards were ineffective because
it gets called before InRecovery is set in the first place.

It seems pretty clear that we can't effectively determine anything
about member wraparound until the cluster is consistent.  Before then,
there might be files missing from the offsets or members SLRUs which
get put back during replay.  There could even be gaps in the sequence
of files, with some things having made it to disk before the crash (or
having made it into the backup) and others not.  So all the work of
determining what the safe stop points and vacuum thresholds for
members are needs to be postponed until TrimMultiXact() time.  And
that's fine, because we don't need this information in recovery anyway
- it only affects behavior in normal running.

So this patch does the following:

1. Moves the call to DetermineSafeOldestOffset() that appears in
StartupMultiXact() to TrimMultiXact(), so that we don't try to do this
until we're consistent.  Also, instead of passing
MultiXactState->oldestMultiXactId, pass the newer of that value and
the earliest offset that exists on disk.  That way, it won't try to
read data that's not there.  Note that the second call to
DetermineSafeOldestOffset() in TruncateMultiXact() doesn't need a
similar guard, because we already bail out of that function early if
the multixacts we're going to truncate away don't exist.

2. Adds a new flag MultiXactState->didTrimMultiXact indicate whether
we've finished TrimMultiXact(), and arranges for SetMultiXactIdLimit()
to use that rather than InRecovery to test whether it's safe to do
complicated things that might require that the cluster is consistent.
This is a slight behavior change, since formerly we would have tried
to do that stuff very early in the startup process, and now it won't
happen until somebody completes a vacuum operation.  If that's a
problem, we could consider doing it in TrimMultiXact(), but I don't
think it's safe the way it was.  The new flag also prevents
oldestOffset from being set while in recovery; I think it would be
safe to do that in recovery once we've reached consistency, but I
don't believe it's necessary.

3. Arranges for TrimMultiXact() to set oldestOffset.  This is
necessary because, previously, we relied on SetMultiXactIdLimit doing
that during early startup or during recovery, and that's no longer
true.  Here too we set oldestOffset keeping in mind that our notion of
the oldest multixact may point to something that doesn't exist; if so,
we use the oldest MXID that does.

4. Modifies TruncateMultiXact() so that it doesn't re-scan the SLRU
directory on every call to find the oldest file that exists.  Instead,
it arranges to remember the value from the first scan and then updates
it thereafter to reflect its own truncation activity.  This isn't
absolutely necessary, but because this oldest-file logic is used in
multiple places (TrimMultiXact, SetMultiXactIdLimit, and
TruncateMultiXact all need it directly or indirectly) caching the
value seems like a better idea than recomputing it frequently.

I have tested that this patch fixes Steve Kehlet's problem, or at
least what I believe to be Steve Kehlet's problem based on the
reproduction scenario I described upthread.  I believe it will also
fix the problems with starting up from a base backup with Alvaro
mentioned upthread.  It won't fix the fact that pg_upgrade is putting
a wrong value into everybody's datminmxid field, which should really
be addressed too, but I've been working on this for about three days
virtually non-stop and I don't have the energy to tackle it right now.
If anyone feels the urge to step into that breech, I think what it
needs to do is: when upgrading from a 9.3-or-later instance, copy over
each database's datminmxid into the corresponding database in the new
cluster.

Aside from that, it's very possible that despite my best efforts this
has serious bugs.  Review and testing would be very much appreciated.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 699497c..8d28a5c 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -197,8 +197,9 @@ typedef struct MultiXactStateData
 	MultiXactOffset nextOffset;
 
 	/*
-	 * Oldest multixact that is still on disk.  Anything older than this
-	 * should not be consulted.  These values are updated by vacuum.
+	 * Old

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Robert Haas
On Fri, May 29, 2015 at 10:17 AM, Tom Lane  wrote:
> Thomas Munro  writes:
>> On Fri, May 29, 2015 at 11:24 AM, Robert Haas  wrote:
>>> B. We need to change find_multixact_start() to fail softly.
>
>> Here is an experimental WIP patch that changes StartupMultiXact and
>> SetMultiXactIdLimit to find the oldest multixact that exists on disk
>> (by scanning the directory), and uses that if it is more recent than
>> the oldestMultiXactId from shmem,
>
> Not sure about the details of this patch, but I was planning to propose
> what I think is the same thing: the way to make find_multixact_start()
> fail softly is to have it find the oldest actually existing file if the
> one that should be there isn't.

Working on that now.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Tom Lane
Thomas Munro  writes:
> On Fri, May 29, 2015 at 11:24 AM, Robert Haas  wrote:
>> B. We need to change find_multixact_start() to fail softly.

> Here is an experimental WIP patch that changes StartupMultiXact and
> SetMultiXactIdLimit to find the oldest multixact that exists on disk
> (by scanning the directory), and uses that if it is more recent than
> the oldestMultiXactId from shmem,

Not sure about the details of this patch, but I was planning to propose
what I think is the same thing: the way to make find_multixact_start()
fail softly is to have it find the oldest actually existing file if the
one that should be there isn't.

This would preserve the idea that we aren't trusting the multixact
tracking data to be completely right, which was the point of 78db307bb
and which evidently is still essential.  It would also obviate the need
for other places to contain similar logic.

regards, tom lane


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Thomas Munro
On Fri, May 29, 2015 at 11:24 AM, Robert Haas  wrote:
> A. Most obviously, we should fix pg_upgrade so that it installs
> chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so
> that we stop creating new instances of this problem.  That won't get
> us out of the hole we've dug for ourselves, but we can at least try to
> stop digging.  (This is assuming I'm right that chkpnt_nxtmulti is the
> wrong thing - anyone want to double-check me on that one?)

Yes, it seems like this could lead to truncation of multixacts still
referenced by tuples, leading to errors when updating, locking,
vacuuming.  Why don't we have reports of that?

> B. We need to change find_multixact_start() to fail softly.  This is
> important because it's legitimate for it to fail in recovery, as
> discussed upthread, and also because we probably want to eliminate the
> fail-to-start hazard introduced in 9.4.2 and 9.3.7.
> find_multixact_start() is used in three places, and they each require
> separate handling:

Here is an experimental WIP patch that changes StartupMultiXact and
SetMultiXactIdLimit to find the oldest multixact that exists on disk
(by scanning the directory), and uses that if it is more recent than
the oldestMultiXactId from shmem, when calling
DetermineSafeOldestOffset.  I'm not all that happy with it, see below,
but let me know what you think.

Using unpatched master, I reproduced the startup error with a bit of a
short cut:

1.  initdb, generate enough multixacts to get more than one offsets file
2.  ALTER DATABASE template0 ALLOW_CONNECTION = true;, vacuumdb
--freeze --all, CHECKPOINT
3.  verify that pg_control now holds a large oldestMultiXactId, and
note NextMultiXactId
4.  shutdown, pg_resetxlog -m (NextMultiXactId from 3),1 pg_data
5.  start up: fails

Apply this patch, and it starts up successfully.

What are the repro steps for the replay problem?  Is a basebackup of a
large database undergoing truncation and some good timing needed?

> - In SetMultiXactIdLimit, find_multixact_start() is used to set
> MultiXactState->oldestOffset, which is used to determine how
> aggressively to vacuum.  If find_multixact_start() fails, we don't
> know how aggressively we need to vacuum to prevent members wraparound;
> it's probably best to decide to vacuum as aggressively as possible.
> Of course, if we're in recovery, we won't vacuum either way; the fact
> that it fails softly is good enough.

Isn't it enough to use the start offset for the most recent of the
oldest multixact ID and the oldest multixact found by scanning
pg_multixact/offsets?  In this patch, it does that, but I'm not happy
with the time the work is done, it just doesn't seem right for
SetMultiXactIdLimit to be scanning that directory.  The result of that
operation should only change when files have been truncated anyway,
and the truncate code was already doing a filesystem scan.  Maybe the
truncate code should store the earliest multixact ID found on disk in
shared memory, so that SetMultiXactIdLimit can use it for free.  I
tried to get that working but couldn't figure out where it should be
initialised -- StartupMultiXact is too late (StartupXLOG calls
SetMultiXactIdLimit before that), but BootstrapMultiXact and
MultiXactShmemInit didn't seem like the right places either.

> - In DetermineSafeOldestOffset, find_multixact_start() is used to set
> MultiXactState->offsetStopLimit.  If it fails here, we don't know when
> to refuse multixact creation to prevent wraparound.  Again, in
> recovery, that's fine.  If it happens in normal running, it's not
> clear what to do.  Refusing multixact creation is an awfully blunt
> instrument.  Maybe we can scan pg_multixact/offsets to determine a
> workable stop limit: the first file greater than the current file that
> exists, minus two segments, is a good stop point.  Perhaps we ought to
> use this mechanism here categorically, not just when
> find_multixact_start() fails.  It might be more robust than what we
> have now.

Done in this patch -- the truncate code calls
DetermineSafeOldestOffset with the earliest SLRU found by scanning if
that's more recent than the shmem value, and then
DetermineSafeOldestOffset applies the step-back-one-whole-segment
logic to that as before.

> - In TruncateMultiXact, find_multixact_start() is used to set the
> truncation point for the members SLRU.  If it fails here, I'm guessing
> the right solution is not to truncate anything - instead, rely on
> intense vacuuming to eventually advance oldestMXact to a value whose
> member data still exists; truncate then.

TruncateMultiXact already contained logic to do nothing at all if
oldestMXact is older than the earliest it can find on disk.  I moved
that code into find_earliest_multixact_on_disk() to be able to use it
elsewhere too, in this patch.

> C. I think we should also change TruncateMultiXact() to truncate
> offsets first, and then members.  As things stand, if we truncate
> members first, we increase the risk of seeing an off

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas
On Thu, May 28, 2015 at 10:41 PM, Alvaro Herrera
 wrote:
>> 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid
>> values which are equal to the next-mxid counter instead of the correct
>> value; in other words, they are too new.
>
> [ discussion of how the control file's oldestMultiXid gets set ]

I'm talking about the datminmxid in pg_database.  You're talking about
the contents of pg_control.  Those are two different things.  The
relevant code is not what you quote, but rather this:

/* set pg_database.datminmxid */
PQclear(executeQueryOrDie(conn_template1,
  "UPDATE
pg_catalog.pg_database "
  "SET
datminmxid = '%u'",

old_cluster.controldata.chkpnt_nxtmulti));

Tom previously observed this to be wrong, here:

http://www.postgresql.org/message-id/9879.1405877...@sss.pgh.pa.us

Although Tom was correct to note that it's wrong, nothing ever got fixed.  :-(

>> A. Most obviously, we should fix pg_upgrade so that it installs
>> chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so
>> that we stop creating new instances of this problem.  That won't get
>> us out of the hole we've dug for ourselves, but we can at least try to
>> stop digging.  (This is assuming I'm right that chkpnt_nxtmulti is the
>> wrong thing - anyone want to double-check me on that one?)
>
> I don't think there's anything that we need to fix here.

I see your followup now agreeing this is broken.  Since I wrote the
previous email, I've had two new ideas that I think are both better
than the above.

1. Figure out the oldest multixact offset that actually exists in
pg_multixacts/offsets, and use that value.  If any older MXIDs still
exist, they won't be able to be looked up anyway, so if they wrap
around, it doesn't matter.  The only value that needs to be reliable
in order to do this is pg_controldata's NextMultiXactId, which to the
best of my knowledge is not implicated in any of these bugs.
pg_upgrade can check that the offsets file containing that value
exists, and if not bail out.  Then, start stepping backwards a file at
a time.  When it hits a missing file, the first multixact in the next
file is a safe value of datfrozenxid for every database in the new
cluster.  If older MXIDs exist, they're unreadable anyway, so if they
wrap, nothing lost.  If the value is older than necessary, the first
vacuum in each database will fix it.  We have to be careful: if we
step back too many files, such that our proposed datfrozenxid might
wrap, then we've got a confusing situation and had better bail out -
or at least think really carefully about what to do.

2. When we're upgrading from a version 9.3 or higher, copy the EXACT
datminmxid from each old database to the corresponding new database.
This seems like it ought to be really simple.

>> - In DetermineSafeOldestOffset, find_multixact_start() is used to set
>> MultiXactState->offsetStopLimit.  If it fails here, we don't know when
>> to refuse multixact creation to prevent wraparound.  Again, in
>> recovery, that's fine.  If it happens in normal running, it's not
>> clear what to do.  Refusing multixact creation is an awfully blunt
>> instrument.  Maybe we can scan pg_multixact/offsets to determine a
>> workable stop limit: the first file greater than the current file that
>> exists, minus two segments, is a good stop point.  Perhaps we ought to
>> use this mechanism here categorically, not just when
>> find_multixact_start() fails.  It might be more robust than what we
>> have now.
>
> Blunt instruments have the desirable property of being simple.  We don't
> want any more clockwork here, I think --- this stuff is pretty
> complicated already.  As far as I understand, if during normal running
> we see that find_multixact_start has failed, sufficient vacuuming should
> get it straight eventually with no loss of data.

Unfortunately, I don't believe that to be true.  If
find_multixact_start() fails, we have no idea how close we are to the
member wraparound point.  Sure, we can start vacuuming, but the user
can be creating new, large multixacts at top speed while we're doing
that, which could cause us to wrap around before we can finish
vacuuming.

Furthermore, if we adopted the blunt instrument, users who are in this
situation would update to 9.4.3 (or whenever these fixes get released)
and find that they can't create new MXIDs for a possibly very
protracted period of time.  That amounts to an outage for which users
won't thank us.

Looking at the files in the directory seems pretty simple in this
case, and quite a bit more fail-safe than what we're doing right now.
The current logic purports to leave a one-file gap in the member
space, but there's no guarantee that the gap really exists on disk the
way we think it does.  With this approach, we can be certain that
there is a gap.  And that is a darned good thing to be certain about.

>> C. 

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Alvaro Herrera
Alvaro Herrera wrote:
> Robert Haas wrote:
> 
> > 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid
> > values which are equal to the next-mxid counter instead of the correct
> > value; in other words, they are too new.
> 
> What you describe is what happens if you upgrade from 9.2 or earlier.

Oh, you're referring to pg_database values, not the ones in pg_control.
Ugh :-(  This invalidates my argument that there's nothing to fix,
obviously ... it's clearly broken as is.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Alvaro Herrera
Robert Haas wrote:

> 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid
> values which are equal to the next-mxid counter instead of the correct
> value; in other words, they are too new.

What you describe is what happens if you upgrade from 9.2 or earlier.
For this case we use this call:

exec_prog(UTILITY_LOG_FILE, NULL, true,
  "\"%s/pg_resetxlog\" -m %u,%u \"%s\"",
  new_cluster.bindir,
  old_cluster.controldata.chkpnt_nxtmulti + 1,
  old_cluster.controldata.chkpnt_nxtmulti,
  new_cluster.pgdata);

This uses the old cluster's nextMulti value as oldestMulti in the new
cluster, and that value+1 is used as nextMulti.  This is correct: we
don't want to preserve any of the multixact state from the previous
cluster; anything before that value can be truncated with no loss of
critical data.  In fact, there is no critical data before that value at
all.

If you upgrade from 9.3, this other call is used instead:

/*
 * we preserve all files and contents, so we must preserve both "next"
 * counters here and the oldest multi present on system.
 */
exec_prog(UTILITY_LOG_FILE, NULL, true,
  "\"%s/pg_resetxlog\" -O %u -m %u,%u \"%s\"",
  new_cluster.bindir,
  old_cluster.controldata.chkpnt_nxtmxoff,
  old_cluster.controldata.chkpnt_nxtmulti,
  old_cluster.controldata.chkpnt_oldstMulti,
  new_cluster.pgdata);

In this case we use the oldestMulti from the old cluster as oldestMulti
in the new cluster, which is also correct.


> A. Most obviously, we should fix pg_upgrade so that it installs
> chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so
> that we stop creating new instances of this problem.  That won't get
> us out of the hole we've dug for ourselves, but we can at least try to
> stop digging.  (This is assuming I'm right that chkpnt_nxtmulti is the
> wrong thing - anyone want to double-check me on that one?)

I don't think there's anything that we need to fix here.


> B. We need to change find_multixact_start() to fail softly.  This is
> important because it's legitimate for it to fail in recovery, as
> discussed upthread, and also because we probably want to eliminate the
> fail-to-start hazard introduced in 9.4.2 and 9.3.7.
> find_multixact_start() is used in three places, and they each require
> separate handling:
> 
> - In SetMultiXactIdLimit, find_multixact_start() is used to set
> MultiXactState->oldestOffset, which is used to determine how
> aggressively to vacuum.  If find_multixact_start() fails, we don't
> know how aggressively we need to vacuum to prevent members wraparound;
> it's probably best to decide to vacuum as aggressively as possible.
> Of course, if we're in recovery, we won't vacuum either way; the fact
> that it fails softly is good enough.

Sounds good.

> - In DetermineSafeOldestOffset, find_multixact_start() is used to set
> MultiXactState->offsetStopLimit.  If it fails here, we don't know when
> to refuse multixact creation to prevent wraparound.  Again, in
> recovery, that's fine.  If it happens in normal running, it's not
> clear what to do.  Refusing multixact creation is an awfully blunt
> instrument.  Maybe we can scan pg_multixact/offsets to determine a
> workable stop limit: the first file greater than the current file that
> exists, minus two segments, is a good stop point.  Perhaps we ought to
> use this mechanism here categorically, not just when
> find_multixact_start() fails.  It might be more robust than what we
> have now.

Blunt instruments have the desirable property of being simple.  We don't
want any more clockwork here, I think --- this stuff is pretty
complicated already.  As far as I understand, if during normal running
we see that find_multixact_start has failed, sufficient vacuuming should
get it straight eventually with no loss of data.

> - In TruncateMultiXact, find_multixact_start() is used to set the
> truncation point for the members SLRU.  If it fails here, I'm guessing
> the right solution is not to truncate anything - instead, rely on
> intense vacuuming to eventually advance oldestMXact to a value whose
> member data still exists; truncate then.

Fine.

> C. I think we should also change TruncateMultiXact() to truncate
> offsets first, and then members.  As things stand, if we truncate
> members first, we increase the risk of seeing an offset that will fail
> when passed to find_multixact_start(), because TruncateMultiXact()
> might get interrupted before it finishes.  That seem like an
> unnecessary risk.

Not sure about this point.  We did it the way you propose previously,
and found it to be a problem because sometimes we tried to read an
offset file that was no longer there.  Do we really read member files
anywhere?  I thought we only tried to read offset files.  If we remove
member files, wha

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas
On Thu, May 28, 2015 at 4:06 PM, Joshua D. Drake  wrote:
> FTR: Robert, you have been a Samurai on this issue. Our many thanks.

Thanks!  I really appreciate the kind words.

So, in thinking through this situation further, it seems to me that
the situation is pretty dire:

1. If you pg_upgrade to 9.3 before 9.3.5, then you may have relminmxid
or datminmxid values which are 1 instead of the correct value.
Setting the value to 1 was too far in the past if your MXID counter is
< 2B, and too far in the future if your MXID counter is > 2B.

2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid
values which are equal to the next-mxid counter instead of the correct
value; in other words, they are two new.

3. If you pg_upgrade to 9.3.5, 9.3.6, 9.4.0, or 9.4.1, then you will
have the first problem for tables in template databases, and the
second one for the rest. (See 866f3017a.)

4. Wrong relminmxid or datminmxid values can eventually propagate into
the control file, as demonstrated in my previous post.  Therefore, we
can't count on relminmxid to be correct, we can't count on datminmxid
to be correct, and we can't count on the control file to be correct.
That's a sack of sad.

5. If the values are too far in the past, then nothing really terrible
will happen unless you upgrade to 9.3.7 or 9.4.2, at which point the
system will refuse to start.  Forcing a VACUUM FREEZE on every
database, including the unconnectable ones, should fix this and allow
you to upgrade safely - which you want to do, because 9.3.7 and 9.4.2
fix a different set of multixact data loss bugs.

6. If the values are too far in the future, the system may fail to
prevent wraparound, leading to data loss.  I am not totally clear on
whether a VACUUM FREEZE will fix this problem.  It seems like the
chances are better if you are running at least 9.3.5+ or 9.4.X,
because of 78db307bb.  But I'm not sure how complete a fix that is.

So what do we do about this?  I have a few ideas:

A. Most obviously, we should fix pg_upgrade so that it installs
chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so
that we stop creating new instances of this problem.  That won't get
us out of the hole we've dug for ourselves, but we can at least try to
stop digging.  (This is assuming I'm right that chkpnt_nxtmulti is the
wrong thing - anyone want to double-check me on that one?)

B. We need to change find_multixact_start() to fail softly.  This is
important because it's legitimate for it to fail in recovery, as
discussed upthread, and also because we probably want to eliminate the
fail-to-start hazard introduced in 9.4.2 and 9.3.7.
find_multixact_start() is used in three places, and they each require
separate handling:

- In SetMultiXactIdLimit, find_multixact_start() is used to set
MultiXactState->oldestOffset, which is used to determine how
aggressively to vacuum.  If find_multixact_start() fails, we don't
know how aggressively we need to vacuum to prevent members wraparound;
it's probably best to decide to vacuum as aggressively as possible.
Of course, if we're in recovery, we won't vacuum either way; the fact
that it fails softly is good enough.

- In DetermineSafeOldestOffset, find_multixact_start() is used to set
MultiXactState->offsetStopLimit.  If it fails here, we don't know when
to refuse multixact creation to prevent wraparound.  Again, in
recovery, that's fine.  If it happens in normal running, it's not
clear what to do.  Refusing multixact creation is an awfully blunt
instrument.  Maybe we can scan pg_multixact/offsets to determine a
workable stop limit: the first file greater than the current file that
exists, minus two segments, is a good stop point.  Perhaps we ought to
use this mechanism here categorically, not just when
find_multixact_start() fails.  It might be more robust than what we
have now.

- In TruncateMultiXact, find_multixact_start() is used to set the
truncation point for the members SLRU.  If it fails here, I'm guessing
the right solution is not to truncate anything - instead, rely on
intense vacuuming to eventually advance oldestMXact to a value whose
member data still exists; truncate then.

C. I think we should also change TruncateMultiXact() to truncate
offsets first, and then members.  As things stand, if we truncate
members first, we increase the risk of seeing an offset that will fail
when passed to find_multixact_start(), because TruncateMultiXact()
might get interrupted before it finishes.  That seem like an
unnecessary risk.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Thomas Munro
On Fri, May 29, 2015 at 7:56 AM, Robert Haas  wrote:
> On Thu, May 28, 2015 at 8:51 AM, Robert Haas  wrote:
>> [ speculation ]
>
> [...]  However, since
> the vacuum did advance relfrozenxid, it will call vac_truncate_clog,
> which will call SetMultiXactIdLimit, which will propagate the bogus
> datminmxid = 1 setting into shared memory.

Ah!

> [...]
>
> - There's a third possible problem related to boundary cases in
> SlruScanDirCbRemoveMembers, but I don't understand that one well
> enough to explain it.  Maybe Thomas can jump in here and explain the
> concern.

I noticed something in passing which is probably not harmful, and not
relevant to this bug report, it was just a bit confusing while
testing:  SlruScanDirCbRemoveMembers never deletes any files if
rangeStart == rangeEnd.  In practice, if you have an idle cluster with
a lot of multixact data and you VACUUM FREEZE all databases and then
CHECKPOINT, you might be surprised to see no member files going away
quite yet, but they'll eventually be truncated by a future checkpoint,
once rangeEnd has had a chance to advance to the next page due to more
multixacts being created.

If we want to fix this one day, maybe the right thing to do is to
treat the rangeStart == rangeEnd case the same way we treat rangeStart
< rangeEnd, that is, to assume that the range of pages isn't
wrapped/inverted in this case.  Although we don't have the actual
start and end offset values to compare here, we know that for them to
fall on the same page, the start offset index must be <= the end
offset index (since we added the new error to prevent member space
wrapping, we never allow the end to get close enough to the start to
fall on the same page).  Like this (not tested):

diff --git a/src/backend/access/transam/multixact.c
b/src/backend/access/transam/multixact.c
index 9568ff1..4d0bcc4 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2755,7 +2755,7 @@ SlruScanDirCbRemoveMembers(SlruCtl ctl, char
*filename, int segpage,
  /* Recheck the deletion condition.  If it still holds, perform deletion */
  if ((range->rangeStart > range->rangeEnd &&
  segpage > range->rangeEnd && segpage < range->rangeStart) ||
- (range->rangeStart < range->rangeEnd &&
+ (range->rangeStart <= range->rangeEnd &&
  (segpage < range->rangeStart || segpage > range->rangeEnd)))
  SlruDeleteSegment(ctl, filename);

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Alvaro Herrera
Robert Haas wrote:
> On Thu, May 28, 2015 at 8:51 AM, Robert Haas  wrote:
> > [ speculation ]
> 
> OK, I finally managed to reproduce this, after some off-list help from
> Steve Kehlet (the reporter), Alvaro, and Thomas Munro.  Here's how to
> do it:

It's a long list of steps, but if you consider them carefully, it
becomes clear that they are natural steps that a normal production
system would go through -- essentially the only one that's really
time-critical is the decision to pg_upgrade with a version before 9.3.5.

> In the process of investigating this, we found a few other things that
> seem like they may also be bugs:
> 
> - As noted upthread, replaying an older checkpoint after a newer
> checkpoint has already happened may lead to similar problems.  This
> may be possible when starting from an online base backup; or when
> restarting a standby that did not perform a restartpoint when
> replaying the last checkpoint before the shutdown.

I'm going through this one now, as it's closely related and caused
issues for us.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Joshua D. Drake


On 05/28/2015 12:56 PM, Robert Haas wrote:




FTR: Robert, you have been a Samurai on this issue. Our many thanks.

Sincerely,

jD


--
Command Prompt, Inc. - http://www.commandprompt.com/  503-667-4564
PostgreSQL Centered full stack support, consulting and development.
Announcing "I'm offended" is basically telling the world you can't
control your own emotions, so everyone else should do it for you.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas
On Thu, May 28, 2015 at 8:51 AM, Robert Haas  wrote:
> [ speculation ]

OK, I finally managed to reproduce this, after some off-list help from
Steve Kehlet (the reporter), Alvaro, and Thomas Munro.  Here's how to
do it:

1. Install any pre-9.3 version of the server and generate enough
multixacts to create at least TWO new segments.  When you shut down
the server, all segments except for the most current one will be
removed.  At this point, the only thing in
$PGDATA/pg_multixact/offsets should be a single file, and the name of
that file should not be  or 0001.

2. Use pg_upgrade to upgrade to 9.3.4.  It is possible that versions <
9.3.4 will also work here, but you must not use 9.3.5 or higher,
because 9.3.5 includes Bruce's commit 3d2e18510, which arranged to
preserve relminmxid and datminmxid values.   At this point,
pg_controldata on the new cluster should show an oldestMultiXid value
greater than 1 (copied from the old cluster), but all the datminmxid
values are 1.  Also, initdb will have left behind a bogus  file in
pg_multixact/offsets.

3. Move to 9.3.5 (or 9.3.6), not via pg_upgrade, but just by dropping
in the new binaries.  Follow the instructions in the 9.3.5 release
notes; since you created at least TWO new segments in step one, there
will be no 0001 file, and the query there will say that you should
remove the bogus  file.  So do that, leaving just the good file in
pg_multixact/offsets.  At this point, pg_multixact/offsets is OK, and
pg_controldata still says that oldestMultiXid > 1, so that is also OK.
The only problem is that we've got some bogus datminmxid values
floating around.  Our next step will be to convince vacuum to
propagate the bogus datminmxid values back into pg_controldata.

4. Consume at least one transaction ID (e.g. SELECT txid_current())
and then do this:

postgres=# set vacuum_freeze_min_age = 0;
SET
postgres=# set vacuum_freeze_table_age = 0;
SET
postgres=# vacuum;
VACUUM

Setting the GUCs forces full table scans, so that we advance
relfrozenxid.  But notice that we were careful not to just run VACUUM
FREEZE, which would have also advanced relminmxid, which, for purposes
of reproducing this bug, is not what we want to happen.  So relminmxid
is still (incorrectly) set to 1 for every database.  However, since
the vacuum did advance relfrozenxid, it will call vac_truncate_clog,
which will call SetMultiXactIdLimit, which will propagate the bogus
datminmxid = 1 setting into shared memory.

(In my testing, this step doesn't work if performed on 9.3.4; you have
to do it on 9.3.5.  I think that's because of Tom's commit 78db307bb,
but I believe in a more complex test scenario you might be able to get
this to happen on 9.3.4 also.)

I believe it's the case that an autovacuum of even a single table can
substitute for this step if it happens to advance relfrozenxid but not
relminmxid.

5. The next checkpoint, or the shutdown checkpoint in any event, will
propagate the bogus value of 1 from shared memory back into the
control file.

6. Now try to start 9.3.7.  It will see the bogus oldestMultiXid = 1
value in the control file, attempt to read the corresponding offsets
file, and die.

In the process of investigating this, we found a few other things that
seem like they may also be bugs:

- As noted upthread, replaying an older checkpoint after a newer
checkpoint has already happened may lead to similar problems.  This
may be possible when starting from an online base backup; or when
restarting a standby that did not perform a restartpoint when
replaying the last checkpoint before the shutdown.

- pg_upgrade sets datminmxid =
old_cluster.controldata.chkpnt_nxtmulti, which is correct only if
there are ZERO multixacts in use at the time of the upgrade.  It would
be best, I think, to set this to the same value it had in the old
cluster, but if we're going to use a blanket value, I think it needs
to be chkpnt_oldstMulti.

- There's a third possible problem related to boundary cases in
SlruScanDirCbRemoveMembers, but I don't understand that one well
enough to explain it.  Maybe Thomas can jump in here and explain the
concern.

Thanks,

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas
On Thu, May 28, 2015 at 8:03 AM, Robert Haas  wrote:
>> Steve, is there any chance we can get your pg_controldata output and a
>> list of all the files in pg_clog?
>
> Err, make that pg_multixact/members, which I assume is at issue here.
> You didn't show us the DETAIL line from this message, which would
> presumably clarify:
>
> FATAL:  could not access status of transaction 1

And I'm still wrong, probably.  The new code in 9.4.2 cares about
being able to look at an *offsets* file to find the corresponding
member offset.  So most likely it is an offsets file that is missing
here.  The question is, how are we ending up with an offsets file that
is referenced by the control file but not actually present on disk?

It seems like it would be good to compare the pg_controldata output to
what is actually present in pg_multixact/offsets (hopefully that's the
right directory, now that I'm on my third try) and try to understand
what is going on here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas
On Thu, May 28, 2015 at 8:01 AM, Robert Haas  wrote:
> On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera
>  wrote:
>> Steve Kehlet wrote:
>>> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we
>>> just dropped new binaries in place) but it wouldn't start up. I found this
>>> in the logs:
>>>
>>> waiting for server to start2015-05-27 13:13:00 PDT [27341]: [1-1] LOG:
>>>  database system was shut down at 2015-05-27 13:12:55 PDT
>>> 2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL:  the database system is
>>> starting up
>>> .2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL:  could not access status of
>>> transaction 1
>>
>> I am debugging today a problem currently that looks very similar to
>> this.  AFAICT the problem is that WAL replay of an online checkpoint in
>> which multixact files are removed fails because replay tries to read a
>> file that has already been removed.
>
> Wait a minute, wait a minute.  There's a serious problem with this
> theory, at least in Steve's scenario.  This message:
>
> 2015-05-27 13:13:00 PDT [27341]: [1-1] LOG: database system was shut
> down at 2015-05-27
>
> That message implies a *clean shutdown*.  If he had performed an
> immediate shutdown or just pulled the plug, it would have said
> "database system was interrupted" or some such.
>
> There may be bugs in redo, also, but they don't explain what happened to 
> Steve.
>
> Steve, is there any chance we can get your pg_controldata output and a
> list of all the files in pg_clog?

Err, make that pg_multixact/members, which I assume is at issue here.
You didn't show us the DETAIL line from this message, which would
presumably clarify:

FATAL:  could not access status of transaction 1

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas
On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera
 wrote:
> Steve Kehlet wrote:
>> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we
>> just dropped new binaries in place) but it wouldn't start up. I found this
>> in the logs:
>>
>> waiting for server to start2015-05-27 13:13:00 PDT [27341]: [1-1] LOG:
>>  database system was shut down at 2015-05-27 13:12:55 PDT
>> 2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL:  the database system is
>> starting up
>> .2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL:  could not access status of
>> transaction 1
>
> I am debugging today a problem currently that looks very similar to
> this.  AFAICT the problem is that WAL replay of an online checkpoint in
> which multixact files are removed fails because replay tries to read a
> file that has already been removed.

Wait a minute, wait a minute.  There's a serious problem with this
theory, at least in Steve's scenario.  This message:

2015-05-27 13:13:00 PDT [27341]: [1-1] LOG: database system was shut
down at 2015-05-27

That message implies a *clean shutdown*.  If he had performed an
immediate shutdown or just pulled the plug, it would have said
"database system was interrupted" or some such.

There may be bugs in redo, also, but they don't explain what happened to Steve.

Steve, is there any chance we can get your pg_controldata output and a
list of all the files in pg_clog?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas
On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera
 wrote:
> Steve Kehlet wrote:
>> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we
>> just dropped new binaries in place) but it wouldn't start up. I found this
>> in the logs:
>>
>> waiting for server to start2015-05-27 13:13:00 PDT [27341]: [1-1] LOG:
>>  database system was shut down at 2015-05-27 13:12:55 PDT
>> 2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL:  the database system is
>> starting up
>> .2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL:  could not access status of
>> transaction 1
>
> I am debugging today a problem currently that looks very similar to
> this.  AFAICT the problem is that WAL replay of an online checkpoint in
> which multixact files are removed fails because replay tries to read a
> file that has already been removed.

Steve: Can you tell us more about how you shut down the old cluster?
Did you by any chance perform an immediate shutdown?  Do you have the
actual log messages that were written when the system was shut down
for the upgrade?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-27 Thread Robert Haas
On Wed, May 27, 2015 at 10:14 PM, Alvaro Herrera
 wrote:
> Well I'm not very clear on what's the problematic case.  The scenario I
> actually saw this first reported was a pg_basebackup taken on a very
> large database, so the master could have truncated multixact and the
> standby receives a truncated directory but actually tries to apply a
> checkpoint that is much older than what the master currently has
> transmitted as pg_multixact contents.

OK, that makes sense.

>> That might be an OK fix, but this implementation doesn't seem very
>> clean.  If we're going to remove the invariant that
>> MultiXactState->oldestOffset will always be valid after replaying a
>> checkpoint, then we should be explicit about that and add a flag
>> indicating whether or not it's currently valid.  Shoving nextOffset in
>> there and hoping that's good enough seems like a bad idea to me.
>>
>> I think we should modify the API for find_multixact_start.  Let's have
>> it return a Boolean and return oldestOffset via an out parameter.  If
>> !InRecovery, it will always return true and set the out parameter; but
>> if in recovery, it is allowed to return false without setting the out
>> parameter.  Both values can get stored in MultiXactState, and we can
>> adjust the logic elsewhere to disregard oldestOffset when the
>> accompanying flag is false.
>
> Sounds good.  I think I prefer that multixact creation is rejected
> altogether if the new flag is false.  Is that what you mean when you say
> "adjust the logic"?

No.  I'm not sure quite what you mean here.  We can't reject multixact
creation during normal running, and during recovery, we won't create
any really new mulitxacts, but we must replay the creation of
multixacts.  What I meant was stuff like this:

if (!MultiXactIdPrecedes(result, MultiXactState->multiVacLimit) ||
(MultiXactState->nextOffset - MultiXactState->oldestOffset
> MULTIXACT_MEMBER_SAFE_THRESHOLD))

I meant that we'd change the second prong of the test to check
multiXactState->nextOffsetValid && MultiXactState->nextOffset -
MultiXactState->oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD.  And
likewise change anything else that relies on oldestOffset.  Or else we
guarantee that we can't reach those points until the oldestOffset is
valid, and then check that it is with an Assert() or elog().

>> This still leaves open an ugly possibility: can we reach normal
>> running without a valid oldestOffset?  If so, until the next
>> checkpoint happens, autovacuum has no clue whether it needs to worry.
>> There's got to be a fix for that, but it escapes me at the moment.
>
> I think the fix to that issue is to set the oldest offset on
> TrimMultiXact.  That way, once WAL replay finished we're certain that we
> have a valid oldest offset to create new multixacts with.
>
> I'm also wondering whether the call to DetermineSafeOldestOffset on
> StartupMultiXact is good.  At that point, we haven't replayed any WAL
> yet, so the oldest multi might be pointing at a file that has already
> been removed -- again considering the pg_basebackup scenario where the
> multixact files are copied much later than pg_control, so the checkpoint
> to replay is old but the pg_multixact contents have already been
> truncated in the master and are copied truncated.

Moving the call from StartupMultiXact() to TrimMultiXact() seems like
a good idea.  I'm not sure why we didn't do that before.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-27 Thread Alvaro Herrera
Robert Haas wrote:
> On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera
>  wrote:
> > Steve Kehlet wrote:
> >> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we
> >> just dropped new binaries in place) but it wouldn't start up. I found this
> >> in the logs:
> >>
> >> waiting for server to start2015-05-27 13:13:00 PDT [27341]: [1-1] LOG:
> >>  database system was shut down at 2015-05-27 13:12:55 PDT
> >> 2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL:  the database system is
> >> starting up
> >> .2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL:  could not access status of
> >> transaction 1
> >
> > I am debugging today a problem currently that looks very similar to
> > this.  AFAICT the problem is that WAL replay of an online checkpoint in
> > which multixact files are removed fails because replay tries to read a
> > file that has already been removed.
> 
> Hmm, so what exactly is the sequence of events here?  It's possible
> that I'm not thinking clearly just now, but it seems to me that if
> we're replaying the same checkpoint we replayed previously, the offset
> of the oldest multixact will be the first file that we didn't remove.

Well I'm not very clear on what's the problematic case.  The scenario I
actually saw this first reported was a pg_basebackup taken on a very
large database, so the master could have truncated multixact and the
standby receives a truncated directory but actually tries to apply a
checkpoint that is much older than what the master currently has
transmitted as pg_multixact contents.

> > I think the fix to this is to verify whether the file exists on disk
> > before reading it; if it doesn't, assume the truncation has already
> > happened and that it's not necessary to remove it.
> 
> That might be an OK fix, but this implementation doesn't seem very
> clean.  If we're going to remove the invariant that
> MultiXactState->oldestOffset will always be valid after replaying a
> checkpoint, then we should be explicit about that and add a flag
> indicating whether or not it's currently valid.  Shoving nextOffset in
> there and hoping that's good enough seems like a bad idea to me.
> 
> I think we should modify the API for find_multixact_start.  Let's have
> it return a Boolean and return oldestOffset via an out parameter.  If
> !InRecovery, it will always return true and set the out parameter; but
> if in recovery, it is allowed to return false without setting the out
> parameter.  Both values can get stored in MultiXactState, and we can
> adjust the logic elsewhere to disregard oldestOffset when the
> accompanying flag is false.

Sounds good.  I think I prefer that multixact creation is rejected
altogether if the new flag is false.  Is that what you mean when you say
"adjust the logic"?

> This still leaves open an ugly possibility: can we reach normal
> running without a valid oldestOffset?  If so, until the next
> checkpoint happens, autovacuum has no clue whether it needs to worry.
> There's got to be a fix for that, but it escapes me at the moment.

I think the fix to that issue is to set the oldest offset on
TrimMultiXact.  That way, once WAL replay finished we're certain that we
have a valid oldest offset to create new multixacts with.

I'm also wondering whether the call to DetermineSafeOldestOffset on
StartupMultiXact is good.  At that point, we haven't replayed any WAL
yet, so the oldest multi might be pointing at a file that has already
been removed -- again considering the pg_basebackup scenario where the
multixact files are copied much later than pg_control, so the checkpoint
to replay is old but the pg_multixact contents have already been
truncated in the master and are copied truncated.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-27 Thread Robert Haas
On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera
 wrote:
> Steve Kehlet wrote:
>> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we
>> just dropped new binaries in place) but it wouldn't start up. I found this
>> in the logs:
>>
>> waiting for server to start2015-05-27 13:13:00 PDT [27341]: [1-1] LOG:
>>  database system was shut down at 2015-05-27 13:12:55 PDT
>> 2015-05-27 13:13:00 PDT [27342]: [1-1] FATAL:  the database system is
>> starting up
>> .2015-05-27 13:13:00 PDT [27341]: [2-1] FATAL:  could not access status of
>> transaction 1
>
> I am debugging today a problem currently that looks very similar to
> this.  AFAICT the problem is that WAL replay of an online checkpoint in
> which multixact files are removed fails because replay tries to read a
> file that has already been removed.

Hmm, so what exactly is the sequence of events here?  It's possible
that I'm not thinking clearly just now, but it seems to me that if
we're replaying the same checkpoint we replayed previously, the offset
of the oldest multixact will be the first file that we didn't remove.
However, I can see that there could be a problem if we try to replay
an older checkpoint after having already replayed a new one - for
example, if a standby replays checkpoint A truncating the members
multixact and performs a restart point, and then replays checkpoint B
truncating the members multixact again but without performing a
restartpoint, and then is shut down, it will resume replay from
checkpoint A, and trouble will ensue.  Is that the scenario, or is
there something else?

> I think the fix to this is to verify whether the file exists on disk
> before reading it; if it doesn't, assume the truncation has already
> happened and that it's not necessary to remove it.

That might be an OK fix, but this implementation doesn't seem very
clean.  If we're going to remove the invariant that
MultiXactState->oldestOffset will always be valid after replaying a
checkpoint, then we should be explicit about that and add a flag
indicating whether or not it's currently valid.  Shoving nextOffset in
there and hoping that's good enough seems like a bad idea to me.

I think we should modify the API for find_multixact_start.  Let's have
it return a Boolean and return oldestOffset via an out parameter.  If
!InRecovery, it will always return true and set the out parameter; but
if in recovery, it is allowed to return false without setting the out
parameter.  Both values can get stored in MultiXactState, and we can
adjust the logic elsewhere to disregard oldestOffset when the
accompanying flag is false.

This still leaves open an ugly possibility: can we reach normal
running without a valid oldestOffset?  If so, until the next
checkpoint happens, autovacuum has no clue whether it needs to worry.
There's got to be a fix for that, but it escapes me at the moment.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general