Hi, On 20/01/18 00:52, Robert Haas wrote: > On Fri, Jan 19, 2018 at 5:19 PM, Tomas Vondra > <tomas.von...@2ndquadrant.com> wrote: >> Regarding the HOT issue - I have to admit I don't quite see why A2 >> wouldn't be reachable through the index, but that's likely due to my >> limited knowledge of the HOT internals. > > The index entries only point to the root tuple in the HOT chain. Any > subsequent entries can only be reached by following the CTID pointers > (that's why they are called "Heap Only Tuples"). After T1 aborts, > we're still OK because the CTID link isn't immediately cleared. But > after T2 updates the tuple, it makes A1's CTID link point to A3, > leaving no remaining link to A2. > > Although in most respects PostgreSQL treats commits and aborts > surprisingly symmetrically, CTID links are an exception. When T2 > comes to A1, it sees that A1's xmax is T1 and checks the status of T1. > If T1 is still in progress, it waits. If T2 has committed, it must > either abort with a serialization error or update A2 instead under > EvalPlanQual semantics, depending on the isolation level. If T2 has > aborted, it assumes that the CTID field of T1 is garbage nobody cares > about, adds A3 to the page, and makes A1 point to A3 instead of A2. > No record of the A1->A2 link is kept anywhere *precisely because* A2 > can no longer be visible to anyone. >
I think this is the only real problem from your list for logical decoding catalog snapshots. But it's indeed quite a bad one. Is there something preventing us to remove the assumption that the CTID of T1 is garbage nobody cares about? I guess turning off HOT for catalogs is not an option :) General problem is that we have couple of assumptions (HeapTupleSatisfiesVacuum being one, what you wrote is another) about tuples from aborted transactions not being read by anybody. But if we want to add decoding of 2PC or transaction streaming that's no longer true so I think we should try to remove that assumption (even if we do it only for catalogs since that what we care about). The other option would be to make sure 2PC decoding/tx streaming does not read aborted transaction but that would mean locking the transaction every time we give control to output plugin. Given that output plugin may do network write, this would really mean locking the transaction for and unbounded period of time. That does not strike me as something we want to do, decoding should not interact with frontend transaction management, definitely not this badly. -- Petr Jelinek http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services