On 23 April 2017 at 01:19, Andres Freund <and...@anarazel.de> wrote: > On 2017-04-22 19:55:18 -0400, Tom Lane wrote: >> Now that we've got consistent failure reports about the 009_twophase.pl >> recovery test, I set out to find out why it's failing. It looks to me >> like the reason is that this (twophase.c:2145): >> >> SubTransSetParent(xid, subxid, overwriteOK); >> >> ought to be this: >> >> SubTransSetParent(subxid, xid, overwriteOK); >> >> because the definition of SubTransSetParent is >> >> void >> SubTransSetParent(TransactionId xid, TransactionId parent, bool overwriteOK) >> >> not the other way 'round. >> >> While "git blame" blames this line on the recent commit 728bd991c, >> that just moved the call from somewhere else. AFAICS this has actually >> been wrong since StandbyRecoverPreparedTransactions was written, >> in 361bd1662 of 2010-04-13. > >> Also, when I fix that, it gets further but still crashes at the same >> Assert in SubTransSetParent. The proximate cause this time seems to be >> that RecoverPreparedTransactions's calculation of overwriteOK is wrong: >> it's computing that as "false", but in reality the subtrans link in >> question has already been set. >> > > Yikes. This is clearly way undertested. It's also pretty scary that > the code has recently been whacked out quite heavily (both 9.6 and > master), without hitting anything around this - doesn't seem to bode > well for how in-depth the testing was.
Obviously if there is a bug it's because tests didn't find it and therefore it is by definition undertested for that specific bug. I'm not sure what other conclusion you wish to draw, if any? >> It's not clear to me how much potential this has to create user data >> corruption, but it doesn't look good at first glance. Discuss. > > Hm. I think it can cause wrong tqual.c results in some edge cases. > During HS, lastOverflowedXid will be set in some cases, and then > XidInMVCCSnapshot etc will not find the actual toplevel xid, which'll in > turn cause lookups snapshot->subxip (where all HS xids reside) > potentially return wrong results. > > I was about to say that I don't see how it could result in persistent > corruption however - the subtrans lookups are only done for > (snapshot->xmin, snapshot->xmax] and subtrans is truncated > regularly. But these days CHECKPOINT_END_OF_RECOVERY isn't obligatory > anymore, so that might be delayed. Hm. I've not found any reason, yet, to believe we return wrong answers in any case, even though the transient data structure pg_subtrans is corrupted by the switched call Tom discovers. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers