[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-16 Thread Thomas Munro
On Wed, Jun 17, 2015 at 6:58 AM, Alvaro Herrera wrote: > Thomas Munro wrote: > >> Thanks. As mentioned elsewhere in the thread, I discovered that the >> same problem exists for page boundaries, with a different error >> message. I've tried the attached repro scripts on 9.3.0, 9.3.5, 9.4.1 >> an

[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-16 Thread Alvaro Herrera
Thomas Munro wrote: > Thanks. As mentioned elsewhere in the thread, I discovered that the > same problem exists for page boundaries, with a different error > message. I've tried the attached repro scripts on 9.3.0, 9.3.5, 9.4.1 > and master with the same results: > > FATAL: could not access s

[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Alvaro Herrera
Robert Haas wrote: > On Fri, Jun 5, 2015 at 2:20 AM, Noah Misch wrote: > > On Thu, Jun 04, 2015 at 05:29:51PM -0400, Robert Haas wrote: > >> Here's a new version with some more fixes and improvements: > > > > I read through this version and found nothing to change. I encourage other > > hackers t

[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Robert Haas
On Fri, Jun 5, 2015 at 2:20 AM, Noah Misch wrote: > On Thu, Jun 04, 2015 at 05:29:51PM -0400, Robert Haas wrote: >> Here's a new version with some more fixes and improvements: > > I read through this version and found nothing to change. I encourage other > hackers to study the patch, though. The

[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-05 Thread Thomas Munro
On Fri, Jun 5, 2015 at 1:47 PM, Thomas Munro wrote: > On Fri, Jun 5, 2015 at 11:47 AM, Thomas Munro > wrote: >> On Fri, Jun 5, 2015 at 9:29 AM, Robert Haas wrote: >>> Here's a new version with some more fixes and improvements: >>> [...] >> >> With this patch, when I run the script >> "checkpoint

[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Noah Misch
On Thu, Jun 04, 2015 at 05:29:51PM -0400, Robert Haas wrote: > Here's a new version with some more fixes and improvements: I read through this version and found nothing to change. I encourage other hackers to study the patch, though. The surrounding code is challenging. > With this version, I'm

[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Thomas Munro
On Fri, Jun 5, 2015 at 11:47 AM, Thomas Munro wrote: > On Fri, Jun 5, 2015 at 9:29 AM, Robert Haas wrote: >> Here's a new version with some more fixes and improvements: >> >> - SetOffsetVacuumLimit was failing to set MultiXactState->oldestOffset >> when the oldest offset became known if the now-k

[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Thomas Munro
On Fri, Jun 5, 2015 at 9:29 AM, Robert Haas wrote: > Here's a new version with some more fixes and improvements: > > - SetOffsetVacuumLimit was failing to set MultiXactState->oldestOffset > when the oldest offset became known if the now-known value happened to > be zero. Fixed. > > - SetOffsetVac

[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Robert Haas
On Thu, Jun 4, 2015 at 5:29 PM, Robert Haas wrote: > - Forces aggressive autovacuuming when the control file's > oldestMultiXid doesn't point to a valid MultiXact and enables member > wraparound at the next checkpoint following the correction of that > problem. Err, enables member wraparound *pro

[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Robert Haas
On Thu, Jun 4, 2015 at 12:57 PM, Robert Haas wrote: > On Thu, Jun 4, 2015 at 9:42 AM, Robert Haas wrote: >> Thanks for the review. > > Here's a new version. I've fixed the things Alvaro and Noah noted, > and some compiler warnings about set but unused variables. > > I also tested it, and it does

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Alvaro Herrera
Alvaro Herrera wrote: > Robert Haas wrote: > > > So here's a patch taking a different approach. > > I tried to apply this to 9.3 but it's messy because of pgindent. Anyone > would have a problem with me backpatching a pgindent run of multixact.c? Done. -- Álvaro Herrerahttp://

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Robert Haas
On Thu, Jun 4, 2015 at 1:27 PM, Andres Freund wrote: > On 2015-06-04 12:57:42 -0400, Robert Haas wrote: >> + /* >> + * Do we need an emergency autovacuum? If we're not sure, assume yes. >> + */ >> + return !oldestOffsetKnown || >> + (nextOffset - oldestOffset > MULTI

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Andres Freund
Hi, On 2015-06-04 12:57:42 -0400, Robert Haas wrote: > + /* > + * Do we need an emergency autovacuum? If we're not sure, assume yes. > + */ > + return !oldestOffsetKnown || > + (nextOffset - oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD); I think without teaching a

[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Robert Haas
On Thu, Jun 4, 2015 at 9:42 AM, Robert Haas wrote: > Thanks for the review. Here's a new version. I've fixed the things Alvaro and Noah noted, and some compiler warnings about set but unused variables. I also tested it, and it doesn't quite work as hoped. If started on a cluster where oldestMu

[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-04 Thread Robert Haas
On Thu, Jun 4, 2015 at 2:42 AM, Noah Misch wrote: > I like that change a lot. It's much easier to seek forgiveness for wasting <= > 28 GiB of disk than for deleting visibility information wrongly. I'm glad you like it. I concur. >> 2. If setting the offset stop limit (the point where we refuse

[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Noah Misch
On Wed, Jun 03, 2015 at 04:53:46PM -0400, Robert Haas wrote: > So here's a patch taking a different approach. In this approach, if > the multixact whose members we want to look up doesn't exist, we don't > use a later one (that might or might not be valid). Instead, we > attempt to cope with the

[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Thomas Munro
On Mon, Jun 1, 2015 at 4:55 PM, Noah Misch wrote: > While testing this (with inconsistent-multixact-fix-master.patch applied, > FWIW), I noticed a nearby bug with a similar symptom. TruncateMultiXact() > omits the nextMXact==oldestMXact special case found in each other > find_multixact_start() ca

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Alvaro Herrera
Robert Haas wrote: > So here's a patch taking a different approach. I tried to apply this to 9.3 but it's messy because of pgindent. Anyone would have a problem with me backpatching a pgindent run of multixact.c? Also, you have a new function SlruPageExists, but we already have SimpleLruDoesPhy

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Robert Haas
On Wed, Jun 3, 2015 at 8:24 AM, Robert Haas wrote: > On Tue, Jun 2, 2015 at 5:22 PM, Andres Freund wrote: >>> > Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning >>> > the disk it'll always get one at a segment boundary, right? I'm not sure >>> > that's actually ok; because th

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Alvaro Herrera
Andres Freund wrote: > On 2015-06-03 15:01:46 -0300, Alvaro Herrera wrote: > > One idea I had was: what if the oldestMulti pointed to another multi > > earlier in the same 0046 file, so that it is read-as-zeroes (and the > > file is created), and then a subsequent multixact truncate tries to read

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Andres Freund
On 2015-06-03 15:01:46 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > That's not necessarily the case though, given how the code currently > > works. In a bunch of places the SLRUs are accessed *before* having been > > made consistent by WAL replay. Especially if several checkpoints/vacuum

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Alvaro Herrera
Alvaro Herrera wrote: > Really, the whole question of how this code goes past the open() failure > in SlruPhysicalReadPage baffles me. I don't see any possible way for > the file to be created ... Hmm, the checkpointer can call TruncateMultiXact when in recovery, on restartpoints. I wonder if in

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Alvaro Herrera
Andres Freund wrote: > On 2015-06-03 00:42:55 -0300, Alvaro Herrera wrote: > > Thomas Munro wrote: > > > On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera > > > wrote: > > > > My guess is that the file existed, and perhaps had one or more pages, > > > > but the wanted page doesn't exist, so we tried

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Andres Freund
On 2015-06-03 00:42:55 -0300, Alvaro Herrera wrote: > Thomas Munro wrote: > > On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera > > wrote: > > > My guess is that the file existed, and perhaps had one or more pages, > > > but the wanted page doesn't exist, so we tried to read but got 0 bytes > > > ba

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Alvaro Herrera
Thomas Munro wrote: > I have finally reproduced that error! See attached repro shell script. > > The conditions are: > > 1. next multixact == oldest multixact (no active multixacts, pointing > past the end) > 2. next multixact would be the first item on a new page (multixact % 2048 == > 0) >

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Robert Haas
On Tue, Jun 2, 2015 at 5:22 PM, Andres Freund wrote: >> > Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning >> > the disk it'll always get one at a segment boundary, right? I'm not sure >> > that's actually ok; because the value at the beginning of the segment >> > can very wel

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Robert Haas
On Wed, Jun 3, 2015 at 4:48 AM, Thomas Munro wrote: > On Wed, Jun 3, 2015 at 3:42 PM, Alvaro Herrera > wrote: >> Thomas Munro wrote: >>> On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera >>> wrote: >>> > My guess is that the file existed, and perhaps had one or more pages, >>> > but the wanted pa

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-03 Thread Thomas Munro
On Wed, Jun 3, 2015 at 3:42 PM, Alvaro Herrera wrote: > Thomas Munro wrote: >> On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera >> wrote: >> > My guess is that the file existed, and perhaps had one or more pages, >> > but the wanted page doesn't exist, so we tried to read but got 0 bytes >> > back

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Alvaro Herrera
Thomas Munro wrote: > On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera > wrote: > > My guess is that the file existed, and perhaps had one or more pages, > > but the wanted page doesn't exist, so we tried to read but got 0 bytes > > back. read() returns 0 in this case but doesn't set errno. > > >

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Thomas Munro
On Tue, Jun 2, 2015 at 9:30 AM, Alvaro Herrera wrote: > My guess is that the file existed, and perhaps had one or more pages, > but the wanted page doesn't exist, so we tried to read but got 0 bytes > back. read() returns 0 in this case but doesn't set errno. > > I didn't find a way to set things

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
> > Hm. If GetOldestMultiXactOnDisk() gets the starting point by scanning > > the disk it'll always get one at a segment boundary, right? I'm not sure > > that's actually ok; because the value at the beginning of the segment > > can very well end up being a 0, as MaybeExtendOffsetSlru() will have >

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas
On Tue, Jun 2, 2015 at 4:19 PM, Andres Freund wrote: > I'm not really convinced tying things closer to having done trimming is > easier to understand than tying things to recovery having finished. > > E.g. > if (did_trim) > oldestOffset = GetOldestReferencedOffset(oldest_da

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-01 14:22:32 -0400, Robert Haas wrote: > commit d33b4eb0167f465edb00bd6c0e1bcaa67dd69fe9 > Author: Robert Haas > Date: Fri May 29 14:35:53 2015 -0400 > > foo Hehe! > diff --git a/src/backend/access/transam/multixact.c > b/src/backend/access/transam/multixact.c > index 9568ff1.

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-02 11:49:56 -0400, Robert Haas wrote: > On Tue, Jun 2, 2015 at 11:44 AM, Andres Freund wrote: > > On 2015-06-02 11:37:02 -0400, Robert Haas wrote: > >> The exact circumstances under which we're willing to replace a > >> relminmxid with a newly-computed one that differs are not altogethe

[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Noah Misch
On Tue, Jun 02, 2015 at 11:16:22AM -0400, Robert Haas wrote: > On Tue, Jun 2, 2015 at 1:21 AM, Noah Misch wrote: > > On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote: > > Granted. Would it be better to update both functions at the same time, and > > perhaps to make that a master-only

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas
On Tue, Jun 2, 2015 at 11:44 AM, Andres Freund wrote: > On 2015-06-02 11:37:02 -0400, Robert Haas wrote: >> The exact circumstances under which we're willing to replace a >> relminmxid with a newly-computed one that differs are not altogether >> clear to me, but there's an "if" statement protectin

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas
On Tue, Jun 2, 2015 at 11:36 AM, Andres Freund wrote: >> That would be a departure from the behavior of every existing release >> that includes this code based on, to my knowledge, zero trouble >> reports. > > On the other hand we're now at about bug #5 attributeable to the odd way > truncation wo

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-02 11:37:02 -0400, Robert Haas wrote: > The exact circumstances under which we're willing to replace a > relminmxid with a newly-computed one that differs are not altogether > clear to me, but there's an "if" statement protecting that logic, so > there are some circumstances in which we'

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas
On Tue, Jun 2, 2015 at 11:27 AM, Andres Freund wrote: > On 2015-06-02 11:16:22 -0400, Robert Haas wrote: >> I'm having trouble figuring out what to do about this. I mean, the >> essential principle of this patch is that if we can't count on >> relminmxid, datminmxid, or the control file to be acc

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-02 11:29:24 -0400, Robert Haas wrote: > On Tue, Jun 2, 2015 at 8:56 AM, Andres Freund wrote: > > But what *definitely* looks wrong to me is that a TruncateMultiXact() in > > this scenario now (since a couple weeks ago) does a > > SimpleLruReadPage_ReadOnly() in the members slru via > >

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas
On Tue, Jun 2, 2015 at 8:56 AM, Andres Freund wrote: > But what *definitely* looks wrong to me is that a TruncateMultiXact() in > this scenario now (since a couple weeks ago) does a > SimpleLruReadPage_ReadOnly() in the members slru via > find_multixact_start(). That just won't work acceptably whe

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-02 11:16:22 -0400, Robert Haas wrote: > I'm having trouble figuring out what to do about this. I mean, the > essential principle of this patch is that if we can't count on > relminmxid, datminmxid, or the control file to be accurate, we can at > least look at what is present on the disk

[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Robert Haas
On Tue, Jun 2, 2015 at 1:21 AM, Noah Misch wrote: > On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote: >> On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch wrote: >> > On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote: >> >> SetMultiXactIdLimit() bracketed certain parts of its >> >>

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-02 Thread Andres Freund
On 2015-06-01 14:22:32 -0400, Robert Haas wrote: > On Mon, Jun 1, 2015 at 4:58 AM, Andres Freund wrote: > > The lack of WAL logging actually has caused problems in the 9.3.3 (?) > > era, where we didn't do any truncation during recovery... > > Right, but now we're piggybacking on the checkpoint r

[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Noah Misch
On Mon, Jun 01, 2015 at 02:06:05PM -0400, Robert Haas wrote: > On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch wrote: > > On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote: > >> SetMultiXactIdLimit() bracketed certain parts of its > >> logic with if (!InRecovery), but those guards were ineff

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Alvaro Herrera
Alvaro Herrera wrote: > Anyway here's a quick script to almost-reproduce the problem. Meh. Really attached now. I also wanted to post the error messages we got: 2015-05-27 16:15:17 UTC [4782]: [3-1] user=,db= LOG: entering standby mode 2015-05-27 16:15:18 UTC [4782]: [4-1] user=,db= LOG: resto

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Alvaro Herrera
Alvaro Herrera wrote: > Robert Haas wrote: > > In the process of investigating this, we found a few other things that > > seem like they may also be bugs: > > > > - As noted upthread, replaying an older checkpoint after a newer > > checkpoint has already happened may lead to similar problems. Th

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Alvaro Herrera
Thomas Munro wrote: > > - There's a third possible problem related to boundary cases in > > SlruScanDirCbRemoveMembers, but I don't understand that one well > > enough to explain it. Maybe Thomas can jump in here and explain the > > concern. > > I noticed something in passing which is probably n

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Robert Haas
On Mon, Jun 1, 2015 at 4:58 AM, Andres Freund wrote: >> I'm probably biased here, but I think we should finish reviewing, >> testing, and committing my patch before we embark on designing this. > > Probably, yes. I am wondering whether doing this immediately won't end > up making some things simpl

[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Robert Haas
On Mon, Jun 1, 2015 at 12:46 AM, Noah Misch wrote: > Incomplete review, done in a relative rush: Thanks. > On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote: >> OK, here's a patch. Actually two patches, differing only in >> whitespace, for 9.3 and for master (ha!). I now think that t

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-06-01 Thread Andres Freund
On 2015-05-31 07:51:59 -0400, Robert Haas wrote: > > 1) We continue determining the oldest > > SlruScanDirectory(SlruScanDirCbFindEarliest) > >on the master to find the oldest offsets segment to > >truncate. Alternatively, if we determine it to be safe, we could use > >oldestMulti to f

[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-31 Thread Noah Misch
On Fri, May 29, 2015 at 10:37:57AM +1200, Thomas Munro wrote: > On Fri, May 29, 2015 at 7:56 AM, Robert Haas wrote: > > - There's a third possible problem related to boundary cases in > > SlruScanDirCbRemoveMembers, but I don't understand that one well > > enough to explain it. Maybe Thomas can j

[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-31 Thread Noah Misch
Incomplete review, done in a relative rush: On Fri, May 29, 2015 at 03:08:11PM -0400, Robert Haas wrote: > OK, here's a patch. Actually two patches, differing only in > whitespace, for 9.3 and for master (ha!). I now think that the root > of the problem here is that DetermineSafeOldestOffset() a

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-31 Thread Robert Haas
On Sat, May 30, 2015 at 8:55 PM, Andres Freund wrote: > Is oldestMulti, nextMulti - 1 really suitable for this? Are both > actually guaranteed to exist in the offsets slru and be valid? Hm. I > guess you intend to simply truncate everything else, but just in > offsets? oldestMulti in theory is t

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-30 Thread Andres Freund
On 2015-05-30 00:52:37 -0300, Alvaro Herrera wrote: > Andres Freund wrote: > > > I considered for a second whether the solution for that could be to not > > truncate while inconsistent - but I think that doesn't solve anything as > > then we can end up with directories where every single offsets/me

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Alvaro Herrera
Bruce Momjian wrote: > I think we need to step back and look at the brain power required to > unravel the mess we have made regarding multi-xact and fixes. (I bet > few people can even remember which multi-xact fixes went into which > releases --- I can't.) Instead of working on actual features,

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Alvaro Herrera
Andres Freund wrote: > I considered for a second whether the solution for that could be to not > truncate while inconsistent - but I think that doesn't solve anything as > then we can end up with directories where every single offsets/member > file exists. Hang on a minute. We don't need to scan

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Thomas Munro
On Sat, May 30, 2015 at 1:46 PM, Andres Freund wrote: > On 2015-05-29 15:08:11 -0400, Robert Haas wrote: >> It seems pretty clear that we can't effectively determine anything >> about member wraparound until the cluster is consistent. > > I wonder if this doesn't actually hints at a bigger problem

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Robert Haas
On Fri, May 29, 2015 at 9:46 PM, Andres Freund wrote: > On 2015-05-29 15:08:11 -0400, Robert Haas wrote: >> It seems pretty clear that we can't effectively determine anything >> about member wraparound until the cluster is consistent. > > I wonder if this doesn't actually hints at a bigger problem

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Robert Haas
On Fri, May 29, 2015 at 3:08 PM, Robert Haas wrote: > It won't fix the fact that pg_upgrade is putting > a wrong value into everybody's datminmxid field, which should really > be addressed too, but I've been working on this for about three days > virtually non-stop and I don't have the energy to t

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Andres Freund
On 2015-05-29 15:08:11 -0400, Robert Haas wrote: > It seems pretty clear that we can't effectively determine anything > about member wraparound until the cluster is consistent. I wonder if this doesn't actually hints at a bigger problem. Currently, to determine where we need to truncate SlruScanD

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Andres Freund
On 2015-05-29 15:49:53 -0400, Bruce Momjian wrote: > I think we need to step back and look at the brain power required to > unravel the mess we have made regarding multi-xact and fixes. (I bet > few people can even remember which multi-xact fixes went into which > releases --- I can't.) Instead o

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Andres Freund
On 2015-05-30 10:55:30 +1200, Thomas Munro wrote: > That's the error message, but then further down: Ooops. > "I have confirmed that directory "pg_multixact/members" does not > existing in the restored data directory. > > I can see this directory and the file if i restore a few days old > backup.

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Thomas Munro
On Sat, May 30, 2015 at 10:48 AM, Andres Freund wrote: > On 2015-05-30 10:41:01 +1200, Thomas Munro wrote: >> On Sat, May 30, 2015 at 10:29 AM, Robert Haas wrote: >> > On Fri, May 29, 2015 at 5:14 PM, Josh Berkus wrote: >> >> Just saw what looks like a report of this issue on 9.2. >> >> >> >> ht

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Andres Freund
On 2015-05-30 10:41:01 +1200, Thomas Munro wrote: > On Sat, May 30, 2015 at 10:29 AM, Robert Haas wrote: > > On Fri, May 29, 2015 at 5:14 PM, Josh Berkus wrote: > >> Just saw what looks like a report of this issue on 9.2. > >> > >> https://github.com/wal-e/wal-e/issues/177 > > > > Urk. That look

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Thomas Munro
On Sat, May 30, 2015 at 10:29 AM, Robert Haas wrote: > On Fri, May 29, 2015 at 5:14 PM, Josh Berkus wrote: >> Just saw what looks like a report of this issue on 9.2. >> >> https://github.com/wal-e/wal-e/issues/177 > > Urk. That looks awfully similar, but I don't think any of the code > that is a

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Robert Haas
On Fri, May 29, 2015 at 5:14 PM, Josh Berkus wrote: > Just saw what looks like a report of this issue on 9.2. > > https://github.com/wal-e/wal-e/issues/177 Urk. That looks awfully similar, but I don't think any of the code that is affected here exists in 9.2, or that any of the fixes involved we

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Steve Kehlet
On Fri, May 29, 2015 at 12:08 PM Robert Haas wrote: > OK, here's a patch. > I grabbed branch REL9_4_STABLE from git, and Robert got me a 9.4-specific patch. I rebuilt, installed, and postgres started up successfully! I did a bunch of checks, had our app run several thousand SQL queries against

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Josh Berkus
All, Just saw what looks like a report of this issue on 9.2. https://github.com/wal-e/wal-e/issues/177 -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Bruce Momjian
On Thu, May 28, 2015 at 07:24:26PM -0400, Robert Haas wrote: > On Thu, May 28, 2015 at 4:06 PM, Joshua D. Drake > wrote: > > FTR: Robert, you have been a Samurai on this issue. Our many thanks. > > Thanks! I really appreciate the kind words. > > So, in thinking through this situation further,

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Robert Haas
On Fri, May 29, 2015 at 12:43 PM, Robert Haas wrote: > Working on that now. OK, here's a patch. Actually two patches, differing only in whitespace, for 9.3 and for master (ha!). I now think that the root of the problem here is that DetermineSafeOldestOffset() and SetMultiXactIdLimit() were larg

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Robert Haas
On Fri, May 29, 2015 at 10:17 AM, Tom Lane wrote: > Thomas Munro writes: >> On Fri, May 29, 2015 at 11:24 AM, Robert Haas wrote: >>> B. We need to change find_multixact_start() to fail softly. > >> Here is an experimental WIP patch that changes StartupMultiXact and >> SetMultiXactIdLimit to find

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Tom Lane
Thomas Munro writes: > On Fri, May 29, 2015 at 11:24 AM, Robert Haas wrote: >> B. We need to change find_multixact_start() to fail softly. > Here is an experimental WIP patch that changes StartupMultiXact and > SetMultiXactIdLimit to find the oldest multixact that exists on disk > (by scanning t

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Christoph Berg
Re: Robert Haas 2015-05-29 > > FTR: Robert, you have been a Samurai on this issue. Our many thanks. > > Thanks! I really appreciate the kind words. I'm still watching with admiration. This list of steps-to-reproduce is the longest and at the same time best I've ever seen. If anyone ever asks

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-29 Thread Thomas Munro
On Fri, May 29, 2015 at 11:24 AM, Robert Haas wrote: > A. Most obviously, we should fix pg_upgrade so that it installs > chkpnt_oldstMulti instead of chkpnt_nxtmulti into datfrozenxid, so > that we stop creating new instances of this problem. That won't get > us out of the hole we've dug for ours

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas
On Thu, May 28, 2015 at 10:41 PM, Alvaro Herrera wrote: >> 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid >> values which are equal to the next-mxid counter instead of the correct >> value; in other words, they are too new. > > [ discussion of how the control file's oldestMul

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Alvaro Herrera
Alvaro Herrera wrote: > Robert Haas wrote: > > > 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid > > values which are equal to the next-mxid counter instead of the correct > > value; in other words, they are too new. > > What you describe is what happens if you upgrade from 9

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Alvaro Herrera
Robert Haas wrote: > 2. If you pg_upgrade to 9.3.7 or 9.4.2, then you may have datminmxid > values which are equal to the next-mxid counter instead of the correct > value; in other words, they are too new. What you describe is what happens if you upgrade from 9.2 or earlier. For this case we use

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas
On Thu, May 28, 2015 at 4:06 PM, Joshua D. Drake wrote: > FTR: Robert, you have been a Samurai on this issue. Our many thanks. Thanks! I really appreciate the kind words. So, in thinking through this situation further, it seems to me that the situation is pretty dire: 1. If you pg_upgrade to 9

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Thomas Munro
On Fri, May 29, 2015 at 7:56 AM, Robert Haas wrote: > On Thu, May 28, 2015 at 8:51 AM, Robert Haas wrote: >> [ speculation ] > > [...] However, since > the vacuum did advance relfrozenxid, it will call vac_truncate_clog, > which will call SetMultiXactIdLimit, which will propagate the bogus > dat

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Alvaro Herrera
Robert Haas wrote: > On Thu, May 28, 2015 at 8:51 AM, Robert Haas wrote: > > [ speculation ] > > OK, I finally managed to reproduce this, after some off-list help from > Steve Kehlet (the reporter), Alvaro, and Thomas Munro. Here's how to > do it: It's a long list of steps, but if you consider

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Joshua D. Drake
On 05/28/2015 12:56 PM, Robert Haas wrote: FTR: Robert, you have been a Samurai on this issue. Our many thanks. Sincerely, jD -- Command Prompt, Inc. - http://www.commandprompt.com/ 503-667-4564 PostgreSQL Centered full stack support, consulting and development. Announcing "I'm offended"

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas
On Thu, May 28, 2015 at 8:51 AM, Robert Haas wrote: > [ speculation ] OK, I finally managed to reproduce this, after some off-list help from Steve Kehlet (the reporter), Alvaro, and Thomas Munro. Here's how to do it: 1. Install any pre-9.3 version of the server and generate enough multixacts to

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas
On Thu, May 28, 2015 at 8:03 AM, Robert Haas wrote: >> Steve, is there any chance we can get your pg_controldata output and a >> list of all the files in pg_clog? > > Err, make that pg_multixact/members, which I assume is at issue here. > You didn't show us the DETAIL line from this message, which

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas
On Thu, May 28, 2015 at 8:01 AM, Robert Haas wrote: > On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera > wrote: >> Steve Kehlet wrote: >>> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we >>> just dropped new binaries in place) but it wouldn't start up. I found this >>> i

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas
On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera wrote: > Steve Kehlet wrote: >> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we >> just dropped new binaries in place) but it wouldn't start up. I found this >> in the logs: >> >> waiting for server to start2015-05-27 1

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-28 Thread Robert Haas
On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera wrote: > Steve Kehlet wrote: >> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we >> just dropped new binaries in place) but it wouldn't start up. I found this >> in the logs: >> >> waiting for server to start2015-05-27 1

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-27 Thread Robert Haas
On Wed, May 27, 2015 at 10:14 PM, Alvaro Herrera wrote: > Well I'm not very clear on what's the problematic case. The scenario I > actually saw this first reported was a pg_basebackup taken on a very > large database, so the master could have truncated multixact and the > standby receives a trunc

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-27 Thread Alvaro Herrera
Robert Haas wrote: > On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera > wrote: > > Steve Kehlet wrote: > >> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we > >> just dropped new binaries in place) but it wouldn't start up. I found this > >> in the logs: > >> > >> waiting

Re: [HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-27 Thread Robert Haas
On Wed, May 27, 2015 at 6:21 PM, Alvaro Herrera wrote: > Steve Kehlet wrote: >> I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we >> just dropped new binaries in place) but it wouldn't start up. I found this >> in the logs: >> >> waiting for server to start2015-05-27 1

[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-27 Thread Alvaro Herrera
Steve Kehlet wrote: > On Wed, May 27, 2015 at 3:21 PM Alvaro Herrera > wrote: > > > I think a patch like this should be able to fix it ... not tested yet. > > > > Thanks Alvaro. I got a compile error, so looked for other uses of > SimpleLruDoesPhysicalPageExist and added MultiXactOffsetCtl, does

[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-27 Thread Steve Kehlet
On Wed, May 27, 2015 at 3:21 PM Alvaro Herrera wrote: > I think a patch like this should be able to fix it ... not tested yet. > Thanks Alvaro. I got a compile error, so looked for other uses of SimpleLruDoesPhysicalPageExist and added MultiXactOffsetCtl, does this look right? + (!InRecovery |

[HACKERS] Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1

2015-05-27 Thread Alvaro Herrera
Steve Kehlet wrote: > I have a database that was upgraded from 9.4.1 to 9.4.2 (no pg_upgrade, we > just dropped new binaries in place) but it wouldn't start up. I found this > in the logs: > > waiting for server to start2015-05-27 13:13:00 PDT [27341]: [1-1] LOG: > database system was shut do