On Thu, Aug 25, 2011 at 1:55 AM, Markus Wanner <mar...@bluegap.ch> wrote: >> One difference with snapshots is that only the latest snapshot is of >> any interest. > > Theoretically, yes. But as far as I understood, you proposed the > backends copy that snapshot to local memory. And copying takes some > amount of time, possibly being interrupted by other backends which add > newer snapshots... Or do you envision the copying to restart whenever a > new snapshot arrives?
My hope (and it might turn out that I'm an optimist) is that even with a reasonably small buffer it will be very rare for a backend to experience a wraparound condition. For example, consider a buffer with ~6500 entries, approximately 64 * MaxBackends, the approximate size of the current subxip arrays taken in aggregate. I hypothesize that a typical snapshot on a running system is going to be very small - a handful of XIDs at most - because, on the average, transactions are going to commit in *approximately* increasing XID order and, if you take the regression tests as representative of a real workload, only a small fraction of transactions will have more than one XID. So it seems believable to think that the typical snapshot on a machine with max_connections=100 might only be ~10 XIDs, even if none of the backends are read-only. So the backend taking a snapshot only needs to be able to copy < ~64 bytes of information from the ring buffer before other backends write ~27k of data into that buffer, likely requiring hundreds of other commits. That seems vanishingly unlikely; memcpy() is very fast. If it does happen, you can recover by retrying, but it should be a once-in-a-blue-moon kind of thing. I hope. Now, as the size of the snapshot gets bigger, things will eventually become less good. For example if you had a snapshot with 6000 XIDs in it then every commit would need to write over the previous snapshot and things would quickly deteriorate. But you can cope with that situation using the same mechanism we already use to handle big snapshots: toss out all the subtransaction IDs, mark the snapshot as overflowed, and just keep the toplevel XIDs. Now you've got at most ~100 XIDs to worry about, so you're back in the safety zone. That's not ideal in the sense that you will cause more pg_subtrans lookups, but that's the price you pay for having a gazillion subtransactions floating around, and any system is going to have to fall back on some sort of mitigation strategy at some point. There's no useful limit on the number of subxids a transaction can have, so unless you're prepared to throw an unbounded amount of memory at the problem you're going to eventually have to punt. It seems to me that the problem case is when you are just on the edge. Say you have 1400 XIDs in the snapshot. If you compact the snapshot down to toplevel XIDs, most of those will go away and you won't have to worry about wraparound - but you will pay a performance penalty in pg_subtrans lookups. On the other hand, if you don't compact the snapshot, it's not that hard to imagine a wraparound occurring - four snapshot rewrites could wrap the buffer. You would still hope that memcpy() could finish in time, but if you're rewriting 1400 XIDs with any regularity, it might not take that many commits to throw a spanner into the works. If the system is badly overloaded and the backend trying to take a snapshot gets descheduled for a long time at just the wrong moment, it doesn't seem hard to imagine a wraparound happening. Now, it's not hard to recover from a wraparound. In fact, we can pretty easily guarantee that any given attempt to take a snapshot will suffer a wraparound at most once. The writers (who are committing) have to be serialized anyway, so anyone who suffers a wraparound can just grab the same lock in shared mode and retry its snapshot. Now concurrency decreases significantly, because no one else is allowed to commit until that guy has got his snapshot, but right now that's true *every time* someone wants to take a snapshot, so falling back to that strategy occasionally doesn't seem prohibitively bad. However, you don't want it to happen very often, because even leaving aside the concurrency hit, it's double work: you have to try to take a snapshot, realize you've had a wraparound, and then retry. It seems pretty clear that with a big enough ring buffer the wraparound problem will become so infrequent as to be not worth worrying about. I'm theorizing that even with a quite small ring buffer the problem will still be infrequent enough not to worry about, but that might be optimistic. I think I'm going to need some kind of test case that generates very large, frequently changing snapshots. Of course even if wraparound turns out not to be a problem there are other things that could scuttle this whole approach, but I think the idea has enough potential to be worth testing. If the whole thing crashes and burns I hope I'll at least learn enough along the way to design something better... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers