On Thu, Aug 10, 2017 at 2:38 AM, Craig Ringer <cr...@2ndquadrant.com> wrote:
> Yep, so again, you're pushing slots "up" the tree, by name, with a 1:1
> correspondence, and using globally unique slot names to manage state.

Yes, that's what I'm imagining.  (Whether I should instead be
imagining something else is the important question.)

> I'm quite happy to find a better one. But I cannot spend a lot of time
> writing something to have it completely knocked back because the scope just
> got increased again and now it has to do more, so it needs another rewrite.

Well, I can't guarantee anything about that.  I don't tend to argue
against designs to which I myself previously agreed, but other people
may, and there's not a lot I can do about that (although sometimes I
try to persuade them that they're wrong, if I think they are).  Of
course, sometimes you implement something and it doesn't look as good
as you thought it would; that's a risk of software development
generally.  I'd like to push back a bit on the underlying assumption,
though: I don't think that there was ever an agreed-upon design on
this list for failover slots before the first patch showed up.  Well,
anybody's welcome to write code without discussion and drop it to the
list, but if people don't like it, that's the risk you took by not
discussing it first.

> A "failover slot" is identified by a  field in the slot struct and exposed
> in pg_replication_slots. It can be null (not a failover slots). It can
> indicate that the slot was created locally and is "owned" by this node; all
> downstreams should mirror it. It can also indicate that it is a mirror of an
> upstream, in which case clients may not replay from it until it's promoted
> to an owned slot and ceases to be mirrored. Attempts to replay from a
> mirrored slot just ERROR and will do so even once decoding on standby is
> supported.

+1

> This promotion happens automatically if a standby is promoted to a master,
> and can also be done manually via sql function call or walsender command to
> allow for an internal promotion within a cascading replica chain.

+1.

> When a replica connects to an upstream it asks via a new walsender msg "send
> me the state of all your failover slots". Any local mirror slots are
> updated. If they are not listed by the upstream they are known deleted, and
> the mirror slots are deleted on the downstream.

What about slots not listed by the upstream that are currently in use?

> The upstream walsender then sends periodic slot state updates while
> connected, so replicas can advance their mirror slots, and in turn send
> hot_standby_feedback that gets applied to the physical replication slot used
> by the standby, freeing resources held for the slots on the master.

+1.

> There's one big hole left here. When we create a slot on a cascading leaf or
> inner node, it takes time for hot_standby_feedback to propagate the needed
> catalog_xmin "up" the chain. Until the master has set the needed
> catalog_xmin on the physical slot for the closest branch, the inner node's
> slot's catalog_xmin can only be tentative pending confirmation. That's what
> a whole bunch of gruesomeness in the decoding on standby patch was about.
>
> One possible solution to this is to also mirror slots "up", as you alluded
> to: when you create an "owned" slot on a replica, it tells the master at
> connect time / slot creation time "I have this slot X, please copy it up the
> tree". The slot gets copied "up" to the master via cascading layers with a
> different failover slot type indicating it's an up-mirror. Decoding clients
> aren't allowed to replay from an up-mirror slot and it cannot be promoted
> like a down-mirror slot can, it's only there for resource retention. A node
> knows its owned slot is safe to actually use, and is fully created, when it
> sees the walsender report it in the list of failover slots from the master
> during a slot state update.

I'm not sure that this actually prevents the problem you describe.  It
also seems really complicated.  Maybe you can explain further; perhaps
there is a simpler solution (or perhaps this isn't as complicated as I
currently think it is).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to