On Wed, Aug 21, 2019 at 4:44 AM Andres Freund <and...@anarazel.de> wrote: > On 2019-08-20 21:02:18 +1200, Thomas Munro wrote: > > 1. Anyone is allowed to try to read or write data at any UndoRecPtr > > that has been allocated, through the buffer pool (though you'd usually > > want to check it with UndoRecPtrIsDiscarded() first, and only rely on > > the system I'm describing to deal with races). > > > > 2. ReadBuffer() might return InvalidBuffer. This can happen for a > > cache miss, if the smgrread implementation wants to indicate that the > > buffer has been discarded/truncated and that is expected (md.c won't > > ever do that, but undofile.c can). > > Hm. This gives me a bit of a stomach ache. It somehow feels like a weird > form of signalling. Can't quite put my finger on why it makes me feel > queasy.
Well, if we're going to allow concurrent access and discarding, there has to be some part of the system where you can discover that the thing you wanted is gone. What's wrong with here? Stepping back a bit... why do we have a new concept here? The reason we don't have anything like this for relations currently is that we don't have live references to blocks that are expected to be concurrently truncated, due to heavyweight locking. But the whole purpose of the undo system is to allow cheap truncation/discarding of data that you *do* have live references to, and furthermore the discarding is expected to be frequent. The present experiment is about trying to do so without throwing our hands up and using a pessimistic lock. At one point Robert and I discussed some kind of scheme where you'd register your interest in a range of the log before you begin reading (some kind of range locks or hazard pointers), so that you would block discarding in that range, but the system would still allow someone to read in the middle of the log while the discard worker concurrently discards non-overlapping data at the end. But I kept returning to the idea that the buffer pool already has block-level range locking of various kinds. You can register your interest in a range by pinning the buffers. That's when you'll find out if the range is already gone. We could add an extra layer of range locking around that, but it wouldn't be any better, it'd just thrash your bus a bit more, and require more complexity in the discard worker (it has to defer discarding a bit, and either block or go away and come back later). > > 3. UndoLogDiscard() uses DiscardBuffer() to invalidate any currently > > unpinned buffers, and marks as BM_DISCARDED any that happen to be > > pinned right now, so they can't be immediately invalidated. Such > > buffers are never written back and are eligible for reuse on the next > > clock sweep, even if they're written into by a backend that managed to > > do that when we were trying to discard. > > Hm. When is it legitimate for a backend to write into such a buffer? I > guess that's about updating the previous transaction's next pointer? Or > progress info? Yes, previous transaction header's next pointer, and progress counter during rollback. We're mostly interested in the next pointer here, because the progress counter update would normally not be updated at a time when the page might be concurrently discarded. The exception to that is a superuser running CALL pg_force_discard_undo() (a data-eating operation designed to escape a system that can't successfully roll back and gets stuck, blowing away not-yet-rolled-back undo records). Here are some other ideas about how to avoid conflicts between discarding and transaction header update: 1. Lossy self-update-only: You could say that transactions are only allowed to write to their own transaction header, and then have them periodically update their own length in their own transaction header, and then teach the discard worker that the length information is only a starting point for a linear search for the next transaction based on page header information. That removes complexity for writers, but adds complexity and IO and CPU to the discard worker. Bleugh. 2. Strict self-update-only: We could update it as part of transaction cleanup. That is, when you commit or abort, probably at some time when your transaction is still advertised as running, you go and update your own transaction header with your the size. If you never reach that stage, I think you can fix it during crash recovery, during the startup scan that feeds the rollback request queues. That is, if you encounter a transaction header with length == 0, it must be the final one and its length is therefore known and can be updated, before you allow new transactions to begin. There are some complications like backends that exit without crashing, which I haven't thought about. As Amit just pointed out to me, that means that the update is not folded into the same buffer access as the next transaction, but perhaps you can mitigate that by not updating your header if the next header will be on the same page -- the next transaction can do it safely then (this page with the insert pointer on it can't be discarded). As Dilip just pointed out to me, it means that you always do an update that you might not never need to do if the transaction is discarded, to which I have no answer. Bleugh. 3. Perhaps there is a useful middle ground between ideas 1 and 2: if it's 0, the discard worker will perform a scan of page headers to compute the correct value, but if it's non-zero it will consider it to be correct and trust that value. The extra work would only happen after crash recovery or things like elog(FATAL, ...). 4. You could keep one extra transaction around all the time. That is, because we know we only ever want to stamp the transaction header of the previous transaction, don't let a transaction that hasn't been stamped with a length yet be discarded. But now we waste more space! 5. You could figure out a scheme to advertise the block number of the start of the previous transaction. You could have an LWLock that you have to take to stamp the transaction header of the previous transaction, and UndoLogDiscard() only has to take the lock if it wants to discard a range that overlaps with that block. This avoids contention for some workloads, but not others, so it seems like a half measure, and again you still have to deal with InvalidBuffer when reading. It's basically range locking; the buffer pool is already a kind of range locking scheme! These schemes are all about avoiding conflicts between discarding and writing, but you'd still have to tolerate InvalidBuffer for reads (ie reading zheap records) with this scheme, so I suppose you might as well just treat updates the same and not worry about any of the above. > > 5. Separating begin from discard allows the WAL logging for > > UndoLogDiscard() to do filesystem actions before logging, and other > > effects after logging, which have several nice properties if you work > > through the various crash scenarios. > > Hm. ISTM we always need to log before doing some filesystem operation > (see also my recent complaint Robert and I are discussing at the bottom > of [1]). It's just that we can have a separate stage afterwards? > > [1] > https://www.postgresql.org/message-id/CA%2BTgmoZc5JVYORsGYs8YnkSxUC%3DcLQF1Z%2BfcpH2TTKvqkS7MFg%40mail.gmail.com I talked about this a bit with Robert and he pointed out that it's probably not actually necessary to WAL-log these operations at all, now that 'begin' and 'end' (= physical storage range) have been separated from 'discard' and 'insert' (active undo data range). Instead you could do it like this: 1. Maintain begin and end pointers in shared memory only, no WAL, no checkpoint. 2. Compute their initial values by scanning the filesystem at startup time. 3. Treat (logical) discard and insert pointers as today; WAL before shm, checkpoint. 4. begin must be <= discard, and end must be >= insert, or else PANIC. I'm looking into that. > > So now I'd like to get feedback on the sanity of this scheme. I'm not > > saying it doesn't have bugs right now -- I've been trying to figure > > out good ways to test it and I'm not quite there yet -- but the > > concept. One observation I have is that there were already code paths > > in undoaccess.c that can tolerate InvalidBuffer in recovery, due to > > the potentially different discard timing for DO vs REDO. I think > > that's a point in favour of this scheme, but I can see that it's > > inconvenient to have to deal with InvalidBuffer whenever you read. > > FWIW, I'm far from convinced that those are currently quite right. See > discussion pointed to above. Yeah. It seems highly desirable to make it so that all decisions about whether a write to an undo block is required or should be skipped are made on the primary, so that WAL reply just does what it's told. I am working on that. -- Thomas Munro https://enterprisedb.com