On 1 April 2016 at 14:46, Andres Freund <and...@anarazel.de> wrote:
> > However: It's safe for the slot state on the replica to be somewhat > behind > > the local replay from the master, so the slot state on the replica is > older > > than what it would've been at an equivalent time on the master... so long > > as it's not so far behind that the replica has replayed vacuum activity > > that removes rows still required by the slot's declared catalog_xmin. > Even > > then it should actually be OK in practice because the failing-over client > > will always have replayed past that point on the master (otherwise > > catalog_xmin couldn't have advanced on the master), so it'll always ask > to > > start replay at a point at or after the lsn where the catalog_xmin was > > advanced to its most recent value on the old master before failover. It's > > only safe for walsender based decoding though, since there's no way to > > specify the startpoint for sql-level decoding. > > This whole logic fails entirely flats on its face by the fact that even > if you specify a startpoint, we read older WAL and process the > records. The startpoint filters every transaction that commits earlier > than the "startpoint". But with a long-running transaction you still > will have old changes being processed, which need the corresponding > catalog state. > Right. Having read over those draft docs I see what you mean. TL;DR: the only way I can see to do this now is full-on failover slots, or something very similar that uses pluggable WAL + a slot save hook + moving CheckPointReplicationSlots to the start of a checkpoint while WAL is still writeable. To rephrase per my understanding: The client only specifies the point it wants to start seeing decoded commits. Decoding starts from the slot's restart_lsn, and that's the point from which the accumulation of reorder buffer contents begins, the snapshot building process begins, and where accumulation of relcache invalidation information begins. At restart_lsn no xact that is to be emitted to the client may yet be in progress. Decoding, whether or not the xacts will be fed to the output plugin callbacks, requires access to the system catalogs. Therefore catalog_xmin reported by the slot must be >= the real effective catalog_xmin of the heap and valid at the restart_lsn, not just the confirmed flush point or the point the client specifies to resume fetching changes from. On the original copy of the slot on the pre-failover master the restart_lsn would've been further ahead, as would the catalog_xmin. So catalog rows have been purged. When we decode from the old restart_lsn on the replica after failover we're not just doing excess work, we'd be accumulating probably-incorrect invalidation information for the xacts that we're decoding (but not sending to the client). Or just failing to find entries we expect to find in the caches and dying. If the restart_lsn was advanced on the master there must be another safe decoding restart point after the one the old copy of the slot points to. But we don't know where, and we don't have any way to know that we *are* in such a post-failover situation so we can't just scan forward looking for it without looking up invalidation info for xacts we know we'll be throwing away. So it's necessary to ensure that the slot's restart_lsn and catalog_xmin are advanced in a timely, consistent manner on the replica's copy of the slot at a point where no vacuum changes to the catalog that could remove needed tuples have been replayed. The only way I can think of to do that really reliably right now, without full failover slots, is to use the newly committed pluggable WAL mechanism and add a hook to SaveSlotToPath() so slot info can be captured, injected in WAL, and replayed on the replica. It'd also be necessary to move CheckPointReplicationSlots() out of CheckPointGuts() to the start of a checkpoint/restartpoint when WAL writing is still permitted, like the failover slots patch does. Basically, failover slots as a plugin using a hook, without the additions to base backup commands and the backup label. I'm not convinced that's at all better than the full failover slots, and I'm not sure I can make a solid argument for the general purpose utility of the hook it'd need, but at least it'd avoid committing core to that approach. I'd really hate 9.6 to go out with - still - no way to use logical decoding in a basic, bog-standard HA/failover environment. It overwhelmingly limits their utility and it's becoming a major drag on practical use of the feature. That's a difficulty given that the failover slots patch isn't especially trivial and you've shown that lazy sync of slot state is not sufficient. > It's also OK for the slot state to be *ahead* of the replica's replay > > position, i.e. from the future. restart_lsn and catalog_xmin will be > higher > > than they were on the master at the same point in time, but that doesn't > > matter at all, since the client will always be requesting to start from > > that point or later, having replayed up to there on the old master before > > failover. > > I don't think that's true. See my point above about startpoint. > My explanation is certainly wrong. I suspect the result of doing it is still actually OK, though (though pretty useless, per below). The restart_lsn from the newer copy of the slot is, as you said, a point we know we can reconstruct visibility info. Also a point where no xact that we want to decode was already in-progress, since we have to accumulate those in the reorder buffer. The client will be requesting (via its local replication origin progress tracking or whatever) that logical decoding start emitting xact contents at a point that would've been reasonable with the newer restart_lsn from the copied slot, since otherwise it wouldn't have advanced on the master. There's no problem with missing catalog entries, we'll have extras we don't need until vacuum catches up but that's all. The client's requested start point should always be reasonable w.r.t. the latest copy of the master's slot, so long as the replica has replayed up to or past the point the client wants to start receiving decoded commits. Nor should any resources needed have been removed from the replica. So long as the confirmed lsn (and whatever point the client requests replay resume from) is less than the timeline switchpoint to the current timeline it should be OK. If I'm right about that then syncing slot state over such that the slot data is always the same as or newer than it was on the master at the same point in history should be OK. We could grab the master's xlog insert ptr, snapshot its slots, sync 'em over, and advance the recovery target lsn on the replica to allow it to replay up to that point in the master's history, repeating periodically. However, it'd be hideous and it'd be completely impossible to use for sync rep, so it's pretty useless even if it would work. > > I think the comment would be less necessary, and could be moved up to the > > CreateDecodingContext call, if we passed the slot's confirmed_flush > > explicitly to CreateDecodingContext instead of passing InvalidXLogRecPtr > > and having the CreateDecodingContext call infer the start point. That way > > what's going on would be a lot clearer when reading the code. > > How would that solve anything? We still need to process the old records, > hence the xlogreader needs to instructed to read them. Hence > logicalfuncs.c needs to know about it. > Huh? All that'd do would be making what already happens more visible. Instead of passing InvalidXLogRecPtr as start_lsn having it transparently replaced with the confirmed_lsn inside CreateDecodingContext, we'd pass the slot->data.confirmed_lsn directly. IMO it'd just make it a little easier to follow what's going on. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services