Re: [HACKERS] WIP: Failover Slots

2017-09-06 Thread Craig Ringer
On 14 August 2017 at 11:56, Craig Ringer  wrote:

>
> I don't want to block failover slots on decoding on standby just because
> decoding on standby would be nice to have.
>

However, during discussion with Tomas Munro a point has come up that does
block failover slots as currently envisioned - silent timeline divergence.
It's a solid reason why the current design and implementation is
insufficient to solve the problem. This issue exists both with the original
failover slots and with the model Robert and I were discussing.

Say a decoding client has replayed from master up to commit of xid 42 at
1/1000 and confirmed flush, then a failover slots standby of the master is
promoted. The standby has only received WAL from the failed master up to
1/500 with most recent xid 20. Now the standby does some other new xacts,
pushing xid up to 30 at 1/1000 then continuing to insert until xid 50 at
lsn 1/2000.

Then the logical client reconnects. The logical client will connect to the
failover slot fine, and start replay. But it'll ask for replay to start at
1/1000. The standby will happily fast-forward the slot (as it should), and
start replay after 1/1000.

But now we have silent divergence in timelines. The logical replica has
received and committed xacts 20...42 at lsn 1/500 through 1/1000, but these
are not present on the promoted master. And the replica has skipped over
the new-master's xids 20...30 with lsns 1/500 through 1/1000, so they're
present on the new master but not the replica.

IMO, this shows that not including the timeline in replication origins was
a bit of a mistake, since we'd trivially detect this if they were included
- but it's a bit late now.  And anyway, detection would just mean logical
rep would break, which doesn't help much.

The simplest fix, but rather limited, is to require that failover
candidates be in synchronous_standby_names, and delay ReorderBufferCommit
sending the actual commit message until all peers in s_s_n confirm flush of
the commit lsn. But that's not much good if you want sync rep for your
logical connections too, and is generally a hack.

A more general solution requires that masters be told which peers are
failover candidates, so they can ensure ordering between logical decoding
and physical failover candidates. Which effectively adds another kind of
sync rep, where we do "wait for physical failover candidates to flush, and
only then allow logical decoding". This actually seems pretty practical
with the design Robert and I discussed, but it's definitely an expansion in
scope.

Alternately, we could require the decoding clients to keep an eye on the
flush/replay positions of all failover candidates and delay commit+confirm
of decoded xacts until the upstream's failover candidates have received and
flushed up to that lsn. Theat starts to look at lot like a decoding on
standby based model for logical failover, where the downstream maintains
slots on each failover candidate upstream.

So yeah. More work needed here. Even if we suddenly decided the original
failover slots model was OK, it's not sufficient to fully solve the problem.

(It's something I'd thought for BDR failover, but never applied to falover
slots: the problem of detecting or preventing divergence when the logical
client is ahead of physical receive at the time the physical standby is
promoted.)

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] WIP: Failover Slots

2017-08-13 Thread Craig Ringer
On 12 August 2017 at 08:03, Andres Freund  wrote:

> On 2017-08-02 16:35:17 -0400, Robert Haas wrote:
> > I actually think failover slots are quite desirable, especially now
> > that we've got logical replication in core.  In a review of this
> > thread I don't see anyone saying otherwise.  The debate has really
> > been about the right way of implementing that.
>
> Given that I presumably was one of the people pushing back more
> strongly: I agree with that.  Besides disagreeing with the proposed
> implementation our disagreements solely seem to have been about
> prioritization.
>
> I still think we should have a halfway agreed upon *design* for logical
> failover, before we introduce a concept that's quite possibly going to
> be incompatible with that, however. But that doesn't mean it has to
> submitted/merged to core.
>

How could it be incompatible? The idea here is to make physical failover
transparent to logical decoding clients. That's not meant to sound
confrontational, I mean that I can't personally see any way it would be and
could use your ideas.

I understand that it might be *different* and you'd like to see more
closely aligned approaches that work more similarly. For which we first
need to know more clearly how logical failover will look. But it's hard not
to also see this as delaying and blocking until your preferred approach via
pure logical rep and logical failover gets in, and physical failover can be
dismissed with "we don't need that anymore". I'm sure that's not your
intent, I just struggle not to see it that way anyway when there's always
another reason not to proceed to solve this problem because of a loosely
related development effort on another problem.

I think there's a couple design goals we need to agree upon, before
> going into the weeds of how exactly we want this to work. Some of the
> axis I can think of are:
>
> - How do we want to deal with cascaded setups, do slots have to be
>   available everywhere, or not?
>

Personally, I don't care either way.


> - What kind of PITR integration do we want? Note that simple WAL based
>   slots do *NOT* provide proper PITR support, there's not enough
>   interlock easily available (you'd have to save slots at the end, then
>   increment minRecoveryLSN to a point later than the slot saving)
>

Interesting. I haven't fully understood this, but think I see what you're
getting at.

As outlined in the prior mail, I'd like to have working PITR with logical
slots but think it's pretty niche as it can't work usefully without plenty
of co-operation from the rest of the logical replication software in use.
You can't just restore and resume normal operations. So I don't think it's
worth making it a priority.

It's possible to make PITR safe with slots by blocking further advance of
catalog_xmin on the running master for the life of the PITR base backup
using a slot for retention. There's plenty of room for operator error
until/unless we add something like catalog_xmin advance xlog'ing, but it
can be done now with external tools if you're careful. Details in the prior
mail.

I don't think PITR for logical slots is important given there's a
workaround and it's not simple to actually do anything with it if you have
it.


> - How much divergence are we going to accept between logical decoding on
>   standbys, and failover slots. I'm probably a lot closer to closer than
>   than Craig is.
>

They're different things to me, but I think you're asking "to what extent
should failover slots functionality be implemented strictly on top of
decoding on standby?"

"Failover slots" provides a mechanism by which a logical decoding client
can expect a slot it creates on a master (or physical streaming replica
doing decoding on standby) to continue to exist. The client can ignore
physical HA and promotions of the master, which can continue to be managed
using normal postgres tools. It's the same as, say, an XA transaction
manager expecting that if your master dies and you fail over to a standby,
the TM should't have to have been doing special housekeeping on the
promotion candidate before promotion in order for 2PC to continue to work.
It Just Works.

Logical decoding on standby is useful with, or without, failover slots, as
you can use it to extract data from a replica, and now decoding timeline
following is in a decoding connection on a replica will survive promotion
to master.

But in addition to its main purpose of allowing logical decoding from a
standby server to offload work, it can be used to implement client-managed
support for failover to physical replicas. For this, the client must have
an inventory of promotion-candidates of the master and their connstrings so
it can maintain slots on them too. The client must be able to connect to
all promotion-candidates and advance their slots via decoding along with
the master slots it's actually replaying from. If a client isn't "told"
about a promotion candidate, decoding will break when 

Re: [HACKERS] WIP: Failover Slots

2017-08-11 Thread Andres Freund
On 2017-08-02 16:35:17 -0400, Robert Haas wrote:
> I actually think failover slots are quite desirable, especially now
> that we've got logical replication in core.  In a review of this
> thread I don't see anyone saying otherwise.  The debate has really
> been about the right way of implementing that.

Given that I presumably was one of the people pushing back more
strongly: I agree with that.  Besides disagreeing with the proposed
implementation our disagreements solely seem to have been about
prioritization.

I still think we should have a halfway agreed upon *design* for logical
failover, before we introduce a concept that's quite possibly going to
be incompatible with that, however. But that doesn't mean it has to
submitted/merged to core.


> - When a standby connects to a master, it can optionally supply a list
> of slot names that it cares about.
> - The master responds by periodically notifying the standby of changes
> to the slot contents using some new replication sub-protocol message.
> - The standby applies those updates to its local copies of the slots.

> So, you could create a slot on a standby with an "uplink this" flag of
> some kind, and it would then try to keep it up to date using the
> method described above.  It's not quite clear to me how to handle the
> case where the corresponding slot doesn't exist on the master, or
> initially does but then it's later dropped, or it initially doesn't
> but it's later created.


I think there's a couple design goals we need to agree upon, before
going into the weeds of how exactly we want this to work. Some of the
axis I can think of are:

- How do we want to deal with cascaded setups, do slots have to be
  available everywhere, or not?
- What kind of PITR integration do we want? Note that simple WAL based
  slots do *NOT* provide proper PITR support, there's not enough
  interlock easily available (you'd have to save slots at the end, then
  increment minRecoveryLSN to a point later than the slot saving)
- How much divergence are we going to accept between logical decoding on
  standbys, and failover slots. I'm probably a lot closer to closer than
  than Craig is.
- How much divergence are we going to accept between infrastructure for
  logical failover, and logical failover via failover slots (or however
  we're naming this)? Again, I'm probably a lot closer to zero than
  craig is.


Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Failover Slots

2017-08-10 Thread Craig Ringer
On 11 August 2017 at 01:02, Robert Haas  wrote:


> Well,
> anybody's welcome to write code without discussion and drop it to the
> list, but if people don't like it, that's the risk you took by not
> discussing it first.
>

Agreed, patches materializing doesn't mean they should be committed, and
there wasn't prior design discussion on this.

It can be hard to elicit it without a patch, but clearly not always, we're
doing a good job of it here.


> > When a replica connects to an upstream it asks via a new walsender msg
> "send
> > me the state of all your failover slots". Any local mirror slots are
> > updated. If they are not listed by the upstream they are known deleted,
> and
> > the mirror slots are deleted on the downstream.
>
> What about slots not listed by the upstream that are currently in use?
>

Yes, it'll also need to send a list of its local owned and up-mirrored
failover slots to the upstream so the upstream can create them or update
their state.

> There's one big hole left here. When we create a slot on a cascading leaf
> or
> > inner node, it takes time for hot_standby_feedback to propagate the
> needed
> > catalog_xmin "up" the chain. Until the master has set the needed
> > catalog_xmin on the physical slot for the closest branch, the inner
> node's
> > slot's catalog_xmin can only be tentative pending confirmation. That's
> what
> > a whole bunch of gruesomeness in the decoding on standby patch was about.
> >
> > One possible solution to this is to also mirror slots "up", as you
> alluded
> > to: when you create an "owned" slot on a replica, it tells the master at
> > connect time / slot creation time "I have this slot X, please copy it up
> the
> > tree". The slot gets copied "up" to the master via cascading layers with
> a
> > different failover slot type indicating it's an up-mirror. Decoding
> clients
> > aren't allowed to replay from an up-mirror slot and it cannot be promoted
> > like a down-mirror slot can, it's only there for resource retention. A
> node
> > knows its owned slot is safe to actually use, and is fully created, when
> it
> > sees the walsender report it in the list of failover slots from the
> master
> > during a slot state update.
>
> I'm not sure that this actually prevents the problem you describe.  It
> also seems really complicated.  Maybe you can explain further; perhaps
> there is a simpler solution (or perhaps this isn't as complicated as I
> currently think it is).



It probably sounds more complex than it is. A slot is created tentatively
and marked not ready to actually use yet when created on a standby. It
flows "up" to the master where it's created as permanent/ready. The
permanent/ready state flows back down to the creator.

When we see a temp slot become permanent we copy the
restart_lsn/catalog_xmin/confirmed_flush_lsn from the upstream slot in case
the master had to advance them from our tentative values when it created
the slot. After that, slot state updates only flow "out" from the owner: up
the tree for up-mirror slots, down the tree for down-mirror slots.

Diagram may help. I focused only on the logical slot created on standby
case, since I think we're happy with the rest already and I don't want to
complicate it.

GMail will probably HTMLize this, sorry:


  Phys rep  Phys rep
  using physusing
  slot "B"  phys slot "C"
+---+ ++ +---+
 T  |  A<^+ B  <-+ C |
 I  |   | || |   |
 M  +---+ ++ +---+
 E |  |  |
 | |  |  |CREATEs
 | |  |  |logical slot X
 v |  |  |("owned")
   |  |  |as temp slot
   |  +<-+
   |  |Creates upmirror  |
   |  |slot "X" linked   |
   |  |to phys slot "C"  |
   |  |marked temp   |
   | <+  |
   |Creates upmirror  |  |
<--+   +-+
   |slot "X" linked   |  |   Attempt to
decode from "X"   | |
   |to phys slot "B"  |  |
   | CLIENT  |
   |marked permanent  |  |
 +->   | |
   +> |  |   ERROR: slot X
still being+-+
   |  |Sees upmirror 

Re: [HACKERS] WIP: Failover Slots

2017-08-10 Thread Robert Haas
On Thu, Aug 10, 2017 at 2:38 AM, Craig Ringer  wrote:
> Yep, so again, you're pushing slots "up" the tree, by name, with a 1:1
> correspondence, and using globally unique slot names to manage state.

Yes, that's what I'm imagining.  (Whether I should instead be
imagining something else is the important question.)

> I'm quite happy to find a better one. But I cannot spend a lot of time
> writing something to have it completely knocked back because the scope just
> got increased again and now it has to do more, so it needs another rewrite.

Well, I can't guarantee anything about that.  I don't tend to argue
against designs to which I myself previously agreed, but other people
may, and there's not a lot I can do about that (although sometimes I
try to persuade them that they're wrong, if I think they are).  Of
course, sometimes you implement something and it doesn't look as good
as you thought it would; that's a risk of software development
generally.  I'd like to push back a bit on the underlying assumption,
though: I don't think that there was ever an agreed-upon design on
this list for failover slots before the first patch showed up.  Well,
anybody's welcome to write code without discussion and drop it to the
list, but if people don't like it, that's the risk you took by not
discussing it first.

> A "failover slot" is identified by a  field in the slot struct and exposed
> in pg_replication_slots. It can be null (not a failover slots). It can
> indicate that the slot was created locally and is "owned" by this node; all
> downstreams should mirror it. It can also indicate that it is a mirror of an
> upstream, in which case clients may not replay from it until it's promoted
> to an owned slot and ceases to be mirrored. Attempts to replay from a
> mirrored slot just ERROR and will do so even once decoding on standby is
> supported.

+1

> This promotion happens automatically if a standby is promoted to a master,
> and can also be done manually via sql function call or walsender command to
> allow for an internal promotion within a cascading replica chain.

+1.

> When a replica connects to an upstream it asks via a new walsender msg "send
> me the state of all your failover slots". Any local mirror slots are
> updated. If they are not listed by the upstream they are known deleted, and
> the mirror slots are deleted on the downstream.

What about slots not listed by the upstream that are currently in use?

> The upstream walsender then sends periodic slot state updates while
> connected, so replicas can advance their mirror slots, and in turn send
> hot_standby_feedback that gets applied to the physical replication slot used
> by the standby, freeing resources held for the slots on the master.

+1.

> There's one big hole left here. When we create a slot on a cascading leaf or
> inner node, it takes time for hot_standby_feedback to propagate the needed
> catalog_xmin "up" the chain. Until the master has set the needed
> catalog_xmin on the physical slot for the closest branch, the inner node's
> slot's catalog_xmin can only be tentative pending confirmation. That's what
> a whole bunch of gruesomeness in the decoding on standby patch was about.
>
> One possible solution to this is to also mirror slots "up", as you alluded
> to: when you create an "owned" slot on a replica, it tells the master at
> connect time / slot creation time "I have this slot X, please copy it up the
> tree". The slot gets copied "up" to the master via cascading layers with a
> different failover slot type indicating it's an up-mirror. Decoding clients
> aren't allowed to replay from an up-mirror slot and it cannot be promoted
> like a down-mirror slot can, it's only there for resource retention. A node
> knows its owned slot is safe to actually use, and is fully created, when it
> sees the walsender report it in the list of failover slots from the master
> during a slot state update.

I'm not sure that this actually prevents the problem you describe.  It
also seems really complicated.  Maybe you can explain further; perhaps
there is a simpler solution (or perhaps this isn't as complicated as I
currently think it is).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Failover Slots

2017-08-10 Thread Craig Ringer
On 9 August 2017 at 23:42, Robert Haas  wrote:

> On Tue, Aug 8, 2017 at 4:00 AM, Craig Ringer 
> wrote:
> >> - When a standby connects to a master, it can optionally supply a list
> >> of slot names that it cares about.
> >
> > Wouldn't that immediately exclude use for PITR and snapshot recovery? I
> have
> > people right now who want the ability to promote a PITR-recovered
> snapshot
> > into place of a logical replication master and have downstream peers
> replay
> > from it. It's more complex than that, as there's a resync process
> required
> > to recover changes the failed node had sent to other peers but isn't
> > available in the WAL archive, but that's the gist.
> >
> > If you have a 5TB database do you want to run an extra replica or two
> > because PostgreSQL can't preserve slots without a running, live replica?
> > Your SAN snapshots + WAL archiving have been fine for everything else so
> > far.
>
> OK, so what you're basically saying here is that you want to encode
> the failover information in the write-ahead log rather than passing it
> at the protocol level, so that if you replay the write-ahead log on a
> time delay you get the same final state that you would have gotten if
> you had replayed it immediately.  I hadn't thought about that
> potential advantage, and I can see that it might be an advantage for
> some reason, but I don't yet understand what the reason is.  How would
> you imagine using any version of this feature in a PITR scenario?  If
> you PITR the master back to an earlier point in time, I don't see how
> you're going to manage without resyncing the replicas, at which point
> you may as well just drop the old slot and create a new one anyway.
>


I've realised that it's possible to work around it in app-space anyway. You
create a new slot on a node before you snapshot it, and you don't drop this
slot until you discard the snapshot. The existence of this slot ensures
that any WAL generated by the node (and replayed by PITR after restore)
cannot clobber needed catalog_xmin. If we xlog catalog_xmin advances or
have some other safeguard in place, which we need for logical decoding on
standby to be safe anyway, then we can fail gracefully if the user does
something dumb.

So no need to care about this.

(What I wrote previously on this was):

You definitely can't just PITR restore and pick up where you left off.

You need a higher level protocol between replicas to recover. For example,
in a multi-master configuration, this can be something like (simplified):

* Use the timeline history file to find the lsn at which we diverged from
our "future self", the failed node
* Connect to the peer and do logical decoding, with a replication origin
filter for "originating from me", for xacts from the divergence lsn up to
the peer's current end-of-wal.
* Reset peer's replication origin for us to our new end-of-wal, and resume
replication

To enable that to be possible, since we can't rewind slots once confirmed
advanced, maintain a backup slot on the peer corresponding to the
point-in-time at which a snapshot was taken.

For most other situations there is little benefit vs just re-creating the
slot before you permit write user-initiated write xacts to begin on the
restored node.

I can accept an argument that "we" as pgsql-hackers do not consider this
something worth caring about, should that be the case. It's niche enough
that you could argue it doesn't have to be supportable in stock postgres.


Maybe you're thinking of a scenario where we PITR the master and also
> use PITR to rewind the replica to a slightly earlier point?


That can work, but must be done in lock-step. You have to pause apply on
both ends for long enough to snapshot both, otherwise the replicaion
origins on one end get out of sync with the slots on another.

Interesting, but I really hope nobody's going to need to do it.


> But I
> can't quite follow what you're thinking about.  Can you explain
> further?
>

Gladly.

I've been up to my eyeballs in this for years now, and sometimes it becomes
quite hard to see the outside perspective, so thanks for your patience.


>
> > Requiring live replication connections could also be an issue for service
> > interruptions, surely?  Unless you persist needed knowledge in the
> physical
> > replication slot used by the standby to master connection, so the master
> can
> > tell the difference between "downstream went away for while but will come
> > back" and "downstream is gone forever, toss out its resources."
>
> I don't think the master needs to retain any resources on behalf of
> the failover slot.  If the slot has been updated by feedback from the
> associated standby, then the master can toss those resources
> immediately.  When the standby comes back on line, it will find out
> via a protocol message that it can fast-forward the slot to whatever
> the new LSN is, and any WAL files before that point are irrelevant on
> both the 

Re: [HACKERS] WIP: Failover Slots

2017-08-09 Thread Robert Haas
On Tue, Aug 8, 2017 at 4:00 AM, Craig Ringer  wrote:
>> - When a standby connects to a master, it can optionally supply a list
>> of slot names that it cares about.
>
> Wouldn't that immediately exclude use for PITR and snapshot recovery? I have
> people right now who want the ability to promote a PITR-recovered snapshot
> into place of a logical replication master and have downstream peers replay
> from it. It's more complex than that, as there's a resync process required
> to recover changes the failed node had sent to other peers but isn't
> available in the WAL archive, but that's the gist.
>
> If you have a 5TB database do you want to run an extra replica or two
> because PostgreSQL can't preserve slots without a running, live replica?
> Your SAN snapshots + WAL archiving have been fine for everything else so
> far.

OK, so what you're basically saying here is that you want to encode
the failover information in the write-ahead log rather than passing it
at the protocol level, so that if you replay the write-ahead log on a
time delay you get the same final state that you would have gotten if
you had replayed it immediately.  I hadn't thought about that
potential advantage, and I can see that it might be an advantage for
some reason, but I don't yet understand what the reason is.  How would
you imagine using any version of this feature in a PITR scenario?  If
you PITR the master back to an earlier point in time, I don't see how
you're going to manage without resyncing the replicas, at which point
you may as well just drop the old slot and create a new one anyway.
Maybe you're thinking of a scenario where we PITR the master and also
use PITR to rewind the replica to a slightly earlier point?  But I
can't quite follow what you're thinking about.  Can you explain
further?

> Requiring live replication connections could also be an issue for service
> interruptions, surely?  Unless you persist needed knowledge in the physical
> replication slot used by the standby to master connection, so the master can
> tell the difference between "downstream went away for while but will come
> back" and "downstream is gone forever, toss out its resources."

I don't think the master needs to retain any resources on behalf of
the failover slot.  If the slot has been updated by feedback from the
associated standby, then the master can toss those resources
immediately.  When the standby comes back on line, it will find out
via a protocol message that it can fast-forward the slot to whatever
the new LSN is, and any WAL files before that point are irrelevant on
both the master and the standby.

> Also, what about cascading? Lots of "pull" model designs I've looked at tend
> to fall down in cascaded environments. For that matter so do failover slots,
> but only for the narrower restriction of not being able to actually decode
> from a failover-enabled slot on a standby, they still work fine in terms of
> cascading down to leaf nodes.

I don't see the problem.  The cascaded standby tells the standby "I'm
interested in the slot called 'craig'" and the standby says "sure,
I'll tell you whenever 'craig' gets updated" but it turns out that
'craig' is actually a failover slot on that standby, so that standby
has said to the master "I'm interested in the slot called 'craig'" and
the master is therefore sending updates to that standby.  Every time
the slot is updated, the master tells the standby and the standby
tells the cascaded standby and, well, that all seems fine.

Also, as Andres pointed out upthread, if the state is passed through
the protocol, you can have a slot on a standby that cascades to a
cascaded standby; if the state is passed through the WAL, all slots
have to cascade from the master.  Generally, with protocol-mediated
failover slots, you can have a different set of slots on every replica
in the cluster and create, drop, and reconfigure them any time you
like.  With WAL-mediated slots, all failover slots must come from the
master and cascade to every standby you've got, which is less
flexible.

I don't want to come on too strong here.  I'm very willing to admit
that you may know a lot more about this than me and I am really
extremely happy to benefit from that accumulated knowledge.  If you're
saying that WAL-mediated slots are a lot better than protocol-mediated
slots, you may well be right, but I don't yet understand the reasons,
and I want to understand the reasons.  I think this stuff is too
important to just have one person saying "here's a patch that does it
this way" and everybody else just says "uh, ok".  Once we adopt some
proposal here we're going to have to continue supporting it forever,
so it seems like we'd better do our best to get it right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Re: [HACKERS] WIP: Failover Slots

2017-08-08 Thread Craig Ringer
On 3 August 2017 at 04:35, Robert Haas  wrote:

> On Tue, Jul 25, 2017 at 8:44 PM, Craig Ringer 
> wrote:
> > No. The whole approach seems to have been bounced from core. I don't
> agree
> > and continue to think this functionality is desirable but I don't get to
> > make that call.
>
> I actually think failover slots are quite desirable, especially now
> that we've got logical replication in core.  In a review of this
> thread I don't see anyone saying otherwise.  The debate has really
> been about the right way of implementing that.  Suppose we did
> something like this:
>
> - When a standby connects to a master, it can optionally supply a list
> of slot names that it cares about.
>

Wouldn't that immediately exclude use for PITR and snapshot recovery? I
have people right now who want the ability to promote a PITR-recovered
snapshot into place of a logical replication master and have downstream
peers replay from it. It's more complex than that, as there's a resync
process required to recover changes the failed node had sent to other peers
but isn't available in the WAL archive, but that's the gist.

If you have a 5TB database do you want to run an extra replica or two
because PostgreSQL can't preserve slots without a running, live replica?
Your SAN snapshots + WAL archiving have been fine for everything else so
far.

Requiring live replication connections could also be an issue for service
interruptions, surely?  Unless you persist needed knowledge in the physical
replication slot used by the standby to master connection, so the master
can tell the difference between "downstream went away for while but will
come back" and "downstream is gone forever, toss out its resources."

That's exactly what the catalog_xmin hot_standby_feedback patches in Pg10
do, but they can only tell the master about the oldest resources needed by
any existing slot on the replica. Not which slots. And they have the same
issues with needing a live, running replica.

Also, what about cascading? Lots of "pull" model designs I've looked at
tend to fall down in cascaded environments. For that matter so do failover
slots, but only for the narrower restriction of not being able to actually
decode from a failover-enabled slot on a standby, they still work fine in
terms of cascading down to leaf nodes.

- The master responds by periodically notifying the standby of changes
> to the slot contents using some new replication sub-protocol message.
> - The standby applies those updates to its local copies of the slots.
>

That's pretty much what I expect to have to do for clients to work on
unpatched Pg10, probably using a separate bgworker and normal libpq
connections to the upstream since we don't have hooks to extend the
walsender/walreceiver.

It can work now that the catalog_xmin hot_standby_feedback patches are in,
but it'd require some low-level slot state setting that I know Andres is
not a fan of. So I expect to carry on relying on an out-of-tree failover
slots patch for Pg 10.



> So, you could create a slot on a standby with an "uplink this" flag of
> some kind, and it would then try to keep it up to date using the
> method described above.  It's not quite clear to me how to handle the
> case where the corresponding slot doesn't exist on the master, or
> initially does but then it's later dropped, or it initially doesn't
> but it's later created.
>
> Thoughts?


Right. So the standby must be running and in active communication. It needs
some way to know the master has confirmed slot creation and it can rely on
the slot's resources really being reserved by the master. That turns out to
be quite hard, per the decoding on standby patches. There needs to be some
way to tell the master a standby has gone away forever and to drop its
dependent slots, so you're not stuck wondering "is slot xxyz from standby
abc that we lost in that crash?". Standbys need to cope with having created
a slot, only to find out there's a name collision with master.

For all those reasons, I just extended hot_standby_feedback to report
catalog_xmin separately to upstreams instead, so the existing physical slot
serves all these needs. And it's part of the picture, but there's no way to
get slot position change info from the master back down onto the replicas
so the replicas can advance any of their own slots and, via feedback, free
up master resources. That's where the bgworker hack to query
pg_replication_slots comes in. Seems complex, full of restrictions, and
fragile to me compared to just expecting the master to do it.

The only objection I personally understood and accepted re failover slots
was that it'd be impossible to create a failover slot on a standby and have
that standby "sub-tree" support failover to leaf nodes. Which is true, but
instead we have noting and no viable looking roadmap toward anything users
can benefit from. So I don't think that's the worst restriction in the
world.

I do not understand 

Re: [HACKERS] WIP: Failover Slots

2017-08-02 Thread Robert Haas
On Tue, Jul 25, 2017 at 8:44 PM, Craig Ringer  wrote:
> No. The whole approach seems to have been bounced from core. I don't agree
> and continue to think this functionality is desirable but I don't get to
> make that call.

I actually think failover slots are quite desirable, especially now
that we've got logical replication in core.  In a review of this
thread I don't see anyone saying otherwise.  The debate has really
been about the right way of implementing that.  Suppose we did
something like this:

- When a standby connects to a master, it can optionally supply a list
of slot names that it cares about.
- The master responds by periodically notifying the standby of changes
to the slot contents using some new replication sub-protocol message.
- The standby applies those updates to its local copies of the slots.

So, you could create a slot on a standby with an "uplink this" flag of
some kind, and it would then try to keep it up to date using the
method described above.  It's not quite clear to me how to handle the
case where the corresponding slot doesn't exist on the master, or
initially does but then it's later dropped, or it initially doesn't
but it's later created.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Failover Slots

2017-07-25 Thread Craig Ringer
On 26 July 2017 at 00:16, Thom Brown  wrote:

> On 8 April 2016 at 07:13, Craig Ringer  wrote:
> > On 6 April 2016 at 22:17, Andres Freund  wrote:
> >
> >>
> >> Quickly skimming 0001 in [4] there appear to be a number of issues:
> >> * LWLockHeldByMe() is only for debugging, not functional differences
> >> * ReplicationSlotPersistentData is now in an xlog related header
> >> * The code and behaviour around name conflicts of slots seems pretty
> >>   raw, and not discussed
> >> * Taking spinlocks dependant on InRecovery() seems like a seriously bad
> >>   idea
> >> * I doubt that the archive based switches around StartupReplicationSlots
> >>   do what they intend. Afaics that'll not work correctly for basebackups
> >>   taken with -X, without recovery.conf
> >>
> >
> > Thanks for looking at it. Most of those are my errors. I think this is
> > pretty dead at least for 9.6, so I'm mostly following up in the hopes of
> > learning about a couple of those mistakes.
> >
> > Good catch with -X without a recovery.conf. Since it wouldn't be
> recognised
> > as a promotion and wouldn't increment the timeline, copied non-failover
> > slots wouldn't get removed. I've never liked that logic at all anyway, I
> > just couldn't think of anything better...
> >
> > LWLockHeldByMe() has a comment to the effect of: "This is meant as debug
> > support only." So that's just a dumb mistake on my part, and I should've
> > added "alreadyLocked" parameters. (Ugly, but works).
> >
> > But why would it be a bad idea to conditionally take a code path that
> > acquires a spinlock based on whether RecoveryInProgress()? It's not
> testing
> > RecoveryInProgress() more than once and doing the acquire and release
> based
> > on separate tests, which would be a problem. I don't really get the
> problem
> > with:
> >
> > if (!RecoveryInProgress())
> > {
> > /* first check whether there's something to write out */
> > SpinLockAcquire(>mutex);
> > was_dirty = slot->dirty;
> > slot->just_dirtied = false;
> > SpinLockRelease(>mutex);
> >
> > /* and don't do anything if there's nothing to write */
> > if (!was_dirty)
> > return;
> > }
> >
> > ... though I think what I really should've done there is just always
> dirty
> > the slot in the redo functions.
>
> Are there any plans to submit a new design/version for v11?


No. The whole approach seems to have been bounced from core. I don't agree
and continue to think this functionality is desirable but I don't get to
make that call.

If time permits I will attempt to update the logical decoding on standby
patchset instead, and if possible add support for fast-forward logical
decoding that does the minimum required to correctly maintain a slot's
catalog_xmin and restart_lsn when advanced. But this won't be usable
directly for failover like failover slots are, it'll require each
application to keep track of standbys and maintain slots on them too. Or
we'll need some kind of extension/helper to sync slot state.

In the mean time, 2ndQuadrant maintains an on-disk-compatible version of
failover slots that's available for support customers.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] WIP: Failover Slots

2017-07-25 Thread Thom Brown
On 8 April 2016 at 07:13, Craig Ringer  wrote:
> On 6 April 2016 at 22:17, Andres Freund  wrote:
>
>>
>> Quickly skimming 0001 in [4] there appear to be a number of issues:
>> * LWLockHeldByMe() is only for debugging, not functional differences
>> * ReplicationSlotPersistentData is now in an xlog related header
>> * The code and behaviour around name conflicts of slots seems pretty
>>   raw, and not discussed
>> * Taking spinlocks dependant on InRecovery() seems like a seriously bad
>>   idea
>> * I doubt that the archive based switches around StartupReplicationSlots
>>   do what they intend. Afaics that'll not work correctly for basebackups
>>   taken with -X, without recovery.conf
>>
>
> Thanks for looking at it. Most of those are my errors. I think this is
> pretty dead at least for 9.6, so I'm mostly following up in the hopes of
> learning about a couple of those mistakes.
>
> Good catch with -X without a recovery.conf. Since it wouldn't be recognised
> as a promotion and wouldn't increment the timeline, copied non-failover
> slots wouldn't get removed. I've never liked that logic at all anyway, I
> just couldn't think of anything better...
>
> LWLockHeldByMe() has a comment to the effect of: "This is meant as debug
> support only." So that's just a dumb mistake on my part, and I should've
> added "alreadyLocked" parameters. (Ugly, but works).
>
> But why would it be a bad idea to conditionally take a code path that
> acquires a spinlock based on whether RecoveryInProgress()? It's not testing
> RecoveryInProgress() more than once and doing the acquire and release based
> on separate tests, which would be a problem. I don't really get the problem
> with:
>
> if (!RecoveryInProgress())
> {
> /* first check whether there's something to write out */
> SpinLockAcquire(>mutex);
> was_dirty = slot->dirty;
> slot->just_dirtied = false;
> SpinLockRelease(>mutex);
>
> /* and don't do anything if there's nothing to write */
> if (!was_dirty)
> return;
> }
>
> ... though I think what I really should've done there is just always dirty
> the slot in the redo functions.

Are there any plans to submit a new design/version for v11?

Thanks

Thom


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Failover Slots

2016-04-08 Thread Craig Ringer
On 6 April 2016 at 22:17, Andres Freund  wrote:


> Quickly skimming 0001 in [4] there appear to be a number of issues:
> * LWLockHeldByMe() is only for debugging, not functional differences
> * ReplicationSlotPersistentData is now in an xlog related header
> * The code and behaviour around name conflicts of slots seems pretty
>   raw, and not discussed
> * Taking spinlocks dependant on InRecovery() seems like a seriously bad
>   idea
> * I doubt that the archive based switches around StartupReplicationSlots
>   do what they intend. Afaics that'll not work correctly for basebackups
>   taken with -X, without recovery.conf
>
>
Thanks for looking at it. Most of those are my errors. I think this is
pretty dead at least for 9.6, so I'm mostly following up in the hopes of
learning about a couple of those mistakes.

Good catch with -X without a recovery.conf. Since it wouldn't be recognised
as a promotion and wouldn't increment the timeline, copied non-failover
slots wouldn't get removed. I've never liked that logic at all anyway, I
just couldn't think of anything better...

LWLockHeldByMe() has a comment to the effect of: "This is meant as debug
support only." So that's just a dumb mistake on my part, and I should've
added "alreadyLocked" parameters. (Ugly, but works).

But why would it be a bad idea to conditionally take a code path that
acquires a spinlock based on whether RecoveryInProgress()? It's not testing
RecoveryInProgress() more than once and doing the acquire and release based
on separate tests, which would be a problem. I don't really get the problem
with:

if (!RecoveryInProgress())
{
/* first check whether there's something to write out */
SpinLockAcquire(>mutex);
was_dirty = slot->dirty;
slot->just_dirtied = false;
SpinLockRelease(>mutex);

/* and don't do anything if there's nothing to write */
if (!was_dirty)
return;
}

... though I think what I really should've done there is just always
dirty the slot in the redo functions.





-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] WIP: Failover Slots

2016-04-06 Thread Simon Riggs
On 6 April 2016 at 15:17, Andres Freund  wrote:

> On 2016-04-06 14:30:21 +0100, Simon Riggs wrote:
> > On 6 April 2016 at 14:15, Craig Ringer  wrote:
> >  ...
> >
> > Nice summary
> >
> > Failover slots are optional. And they work on master.
> >
> > While the other approach could also work, it will work later and still
> > require a slot on the master.
> >
> >
> > => I don't see why having Failover Slots in 9.6 would prevent us from
> > having something else later, if someone else writes it.
> >
> >
> > We don't need to add this to core. Each plugin can independently write is
> > own failover code. Works, but doesn't seem like the right approach for
> open
> > source.
> >
> >
> > => I think we should add Failover Slots to 9.6.
>
> Simon, please don't take this personal; because of the other ongoing
> thread.
>

Thanks for the review. Rational technical comments are exactly why we are
here and they are always welcome.


> For one I think this is architecturally the wrong choice.

But even leaving that fact aside, and
> considering this a temporary solution (we can't easily remove),


As I observed above, the alternate solution doesn't sound particularly good
either but the main point is that we wouldn't need to remove it, it can
coexist happily. I would add that I did think of the alternate solution
previously as well, this one seemed simpler, which is always key for me in
code aimed at robustness.


> there appears to have been very little code level review


That is potentially fixable. At this point I don't claim it is committable,
I only say it is important and the alternate solution is not significantly
better, therefore if the patch can be beaten into shape we should commit it.

I will spend some time on this and see if we have something viable. Which
will be posted here for discussion, as would have happened even before our
other discussions.

Thanks for the points below

(one early from Petr
> in [1], two by Oleksii mostly focusing on error messages [2] [3]). The
> whole patch was submitted late to the 9.6 cycle.
>
> Quickly skimming 0001 in [4] there appear to be a number of issues:
> * LWLockHeldByMe() is only for debugging, not functional differences
> * ReplicationSlotPersistentData is now in an xlog related header
> * The code and behaviour around name conflicts of slots seems pretty
>   raw, and not discussed
> * Taking spinlocks dependant on InRecovery() seems like a seriously bad
>   idea
> * I doubt that the archive based switches around StartupReplicationSlots
>   do what they intend. Afaics that'll not work correctly for basebackups
>   taken with -X, without recovery.conf
>
> That's from a ~5 minute skim, of one patch in the series.
>
>
> [1]
> http://archives.postgresql.org/message-id/CALLjQTSCHvcsF6y7%3DZhmdMjJUMGLqt1-6Pz2rtb7PfFLxFfBOw%40mail.gmail.com
> [2]
> http://archives.postgresql.org/message-id/FA68178E-F0D1-47F6-9791-8A3E2136C119%40hintbits.com
> [3]
> http://archives.postgresql.org/message-id/9B503EB5-676A-4258-9F78-27FC583713FE%40hintbits.com
> [4]
> http://archives.postgresql.org/message-id/camsr+ye6lny2e0tbuaqb+ntvb6w-dhjaflq0-zbal7g7hjh...@mail.gmail.com
>

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] WIP: Failover Slots

2016-04-06 Thread Andres Freund
On 2016-04-06 14:30:21 +0100, Simon Riggs wrote:
> On 6 April 2016 at 14:15, Craig Ringer  wrote:
>  ...
> 
> Nice summary
> 
> Failover slots are optional. And they work on master.
> 
> While the other approach could also work, it will work later and still
> require a slot on the master.
> 
> 
> => I don't see why having Failover Slots in 9.6 would prevent us from
> having something else later, if someone else writes it.
> 
> 
> We don't need to add this to core. Each plugin can independently write is
> own failover code. Works, but doesn't seem like the right approach for open
> source.
> 
> 
> => I think we should add Failover Slots to 9.6.

Simon, please don't take this personal; because of the other ongoing
thread.

I don't think this is commit-ready. For one I think this is
architecturally the wrong choice. But even leaving that fact aside, and
considering this a temporary solution (we can't easily remove), there
appears to have been very little code level review (one early from Petr
in [1], two by Oleksii mostly focusing on error messages [2] [3]). The
whole patch was submitted late to the 9.6 cycle.

Quickly skimming 0001 in [4] there appear to be a number of issues:
* LWLockHeldByMe() is only for debugging, not functional differences
* ReplicationSlotPersistentData is now in an xlog related header
* The code and behaviour around name conflicts of slots seems pretty
  raw, and not discussed
* Taking spinlocks dependant on InRecovery() seems like a seriously bad
  idea
* I doubt that the archive based switches around StartupReplicationSlots
  do what they intend. Afaics that'll not work correctly for basebackups
  taken with -X, without recovery.conf

That's from a ~5 minute skim, of one patch in the series.


[1] 
http://archives.postgresql.org/message-id/CALLjQTSCHvcsF6y7%3DZhmdMjJUMGLqt1-6Pz2rtb7PfFLxFfBOw%40mail.gmail.com
[2] 
http://archives.postgresql.org/message-id/FA68178E-F0D1-47F6-9791-8A3E2136C119%40hintbits.com
[3] 
http://archives.postgresql.org/message-id/9B503EB5-676A-4258-9F78-27FC583713FE%40hintbits.com
[4] 
http://archives.postgresql.org/message-id/camsr+ye6lny2e0tbuaqb+ntvb6w-dhjaflq0-zbal7g7hjh...@mail.gmail.com

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Failover Slots

2016-04-06 Thread Simon Riggs
On 6 April 2016 at 14:15, Craig Ringer  wrote:
 ...

Nice summary

Failover slots are optional. And they work on master.

While the other approach could also work, it will work later and still
require a slot on the master.


=> I don't see why having Failover Slots in 9.6 would prevent us from
having something else later, if someone else writes it.


We don't need to add this to core. Each plugin can independently write is
own failover code. Works, but doesn't seem like the right approach for open
source.


=> I think we should add Failover Slots to 9.6.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] WIP: Failover Slots

2016-04-06 Thread Craig Ringer
A few thoughts on failover slots vs the alternative of pushing catalog_xmin
up to the master via a replica's slot and creating independent slots on
replicas.


Failover slots:
---

+ Failover slots are very easy for applications. They "just work" and are
transparent for failover. This is great especially for things that aren't
complex replication schemes, that just want to use logical decoding.

+ Applications don't have to know what replicas exist or be able to reach
them; transparent failover is easier.

- Failover slots can't be used from a cascading standby (where we can fail
down to the standby's own replicas) because they have to write WAL to
advance the slot position. They'd have to send the slot position update
"up" to the master then wait to replay it. Not a disaster, though they'd do
extra work on reconnect until a restart_lsn update replayed. Would require
a whole new feedback-like message on the rep protocol, and couldn't work at
all with archive replication. Ugly as hell.

+ Failover slots exist now, and could be added to 9.6.

- The UI for failover slots can't be re-used for the catalog_xmin push-up
approach to allow replay from failover slots on cascading standbys in 9.7+.
There'd be no way to propagate the creation of failover slots "down" the
replication heirarchy that way, especially to archive standbys like
failover slots will do. So it'd be semantically different and couldn't
re-use the FS UI. We'd be stuck with failover slots even if we also did the
other way later.

+ Will work for recovery of a master PITR-restored up to the latest
recovery point



Independent slots on replicas + catalog_xmin push-up
---

With this approach we allow creation of replication slots on a replica
independently of the master. The replica is required to connect to the
master via a slot. We send feedback to the master to advance the replica's
slot on the master to the confirmed_lsn of the most-behind slot on the
replica, therefore pinning master's catalog_xmin where needed. Or we just
send a new feedback message type that directly sets a catalog_xmin on the
replica's physical slot in the master. Slots are _not_ cloned from master
to replica automatically.


- More complicated for applications to use. They have to create a slot on
each replica that might be failed over to as well as the master and have to
advance all those slots to stop the master from suffering severe catalog
bloat.  (But see note below).

- Applications must be able to connect to failover-candidate standbys and
know where they are, it's not automagically handled via WAL.  (But see note
below).

- Applications need reconfiguration whenever a standby is rebuilt, moved,
etc. (But see note below).

- Cannot work at all for archive-based replication, requires a slot from
replica to master.

+ Works with replay from cascading standbys

+ Actually solves one of the problems making logical slots on standbys
unsupported at the moment by giving us a way to pin the master's
catalog_xmin to that needed by a replica.

- Won't work for a standby PITR-restored up to latest.

- Vapourware with zero hope for 9.6


Note: I think the application complexity issues can be solved - to a degree
- by having the replicas run a bgworker based helper that connects to the
master and clones the master's slots then advances them automatically.


Do nothing
---

Drop the idea of being able to follow physical failover on logical slots.

I've already expressed why I think this is a terrible idea. It's hostile to
application developers who'd like to use logical decoding. It makes
integration of logical replication with existing HA systems much harder. It
means we need really solid, performant, well-tested and mature logical rep
based HA before we can take logical rep seriously, which is a long way out
given that we can't do decoding of in-progress xacts, ddl, sequences, 
etc etc.

Some kind of physical HA for logical slots is needed and will be needed for
some time. Logical rep will be great for selective replication, replication
over WAN, filtered/transformed replication etc. Physical rep is great for
knowing you'll get exactly the same thing on the replica that you have on
the master and it'll Just Work.

In any case, "Do nothing" is the same for 9.6 as pursusing the catalog_xmin
push-up idea; in both cases we don't commit anything in 9.6.


Re: [HACKERS] WIP: Failover Slots

2016-04-06 Thread Craig Ringer
On 6 April 2016 at 17:43, Simon Riggs  wrote:

> On 25 January 2016 at 14:25, Craig Ringer  wrote:
>
>
>> I'd like to get failover slots in place for 9.6 since the're fairly
>> self-contained and meet an immediate need: allowing replication using slots
>> (physical or logical) to follow a failover event.
>>
>
> I'm a bit confused about this now.
>
> We seem to have timeline following, yet no failover slot. How do we now
> follow a failover event?
>

>

> There are many and varied users of logical decoding now and a fix is
> critically important for 9.6.
>

I agree with you, but I haven't been able to convince enough people of that.


>  Do all decoding plugins need to write their own support code??
>

We'll be able to write a bgworker based extension that handles it by
running in the standby. So no, I don't think so.


> Please explain how we cope without this, so if a problem remains we can
> fix by the freeze.
>

The TL;DR: Create a slot on the master to hold catalog_xmin where the
replica needs it. Advance it using client or bgworker on replica based on
the catalog_xmin of the oldest slot on the replica. Copy slot state from
the master using an extension that keeps the slots on the replica
reasonably up to date.

All of this is ugly workaround for not having true slot failover support.
I'm not going to pretend it's nice, or anything that should go anywhere
near core. Petr outlined the approach we want to take for core in 9.7 on
the logical timeline following thread.


Details:

Logical decoding on a slot can follow timeline switches now - or rather,
the xlogreader knows how to follow timeline switches, and the read page
callback used by logical decoding uses that functionality now.

This doesn't help by its self because slots aren't synced to replicas so
they're lost on failover promotion.

Nor can a client just create a backup slot for its self on the replica to
be ready for failover:

- it has no way to create a new slot at a consistent point on the replica
since logical decoding isn't supported on replicas yet;
- it can't advance a logical slot on the replica once created since
decoding isn't permitted on a replica, so it can't just decode from the
replica in lockstep with the master;
- it has no way to stop the master from removing catalog tuples still
needed by the slot's catalog_xmin since catalog_xmin isn't propagated from
standby to master.

So we have to help the client out. To do so, we have a
function/worker/whatever on the replica that grabs the slot state from the
master and copies it to the replica, and we have to hold the master's
catalog_xmin down to the catalog_xmin required by the slots on the replica.

Holding the catalog_xmin down is the easier bit. We create a dummy logical
slot on the master, maintained by a function/bgworker/whatever on the
replica. It gets advanced so that its restart_lsn and catalog_xmin are
those of the oldest slot on the replica. We can do that by requesting
replay on it up to the confirmed_lsn of the lowest confirmed_lsn on the
replica. Ugly, but workable. Or we can abuse the infrastructure more deeply
by simply setting the catalog_xmin and restart_lsn on the slot directly,
but I'd rather not.

Just copying slot state is pretty simple too, as at the C level you can
create a physical or logical slot with whatever state you want.

However, that lets you copy/create any number of bogus ones, many of which
will appear to work fine but will be subtly broken. Since the replica is an
identical copy of the master we know that a slot state that was valid on
the master at a given xlog insert lsn is also valid on the replica at the
same replay lsn, but we've got no reliable way to ensure that when the
master updates a slot at LSN A/B the replica also updates the slot at
replay of LSN A/B. That's what failover slots did. Without that we need to
use some external channel - but there's no way to capture knowledge of "at
exactly LSN A/B, master saved a new copy of slot X" since we can't hook
ReplicationSlotSave(). At least we *can* now inject slot state updates as
generic WAL messages though, so we can ensure they happen at exactly the
desired point in replay.

As Andres explained on the timeline following thread it's not safe for the
slot on the replica to be behind the state the slot on the master was at
the same LSN. At least unless we can protect catalog_xmin via some other
mechanism so we can make sure no catalogs still needed by the slots on the
replica are vacuumed away. It's vital that the catalog_xmin of any slots on
the replica be >= the catalog_xmin the master had for the lowest
catalog_xmin of any of its slots at the same LSN.

So what I figure we'll do is poll slot shmem on the master. When we notice
that a slot has changed we'll dump it into xlog via the generic xlog
mechanism to be applied on the replica, much like failover slots. The slot
update might arrive a bit late on the replica, but that's OK because we're

Re: [HACKERS] WIP: Failover Slots

2016-04-06 Thread Simon Riggs
On 25 January 2016 at 14:25, Craig Ringer  wrote:


> I'd like to get failover slots in place for 9.6 since the're fairly
> self-contained and meet an immediate need: allowing replication using slots
> (physical or logical) to follow a failover event.
>

I'm a bit confused about this now.

We seem to have timeline following, yet no failover slot. How do we now
follow a failover event?

There are many and varied users of logical decoding now and a fix is
critically important for 9.6.

Do all decoding plugins need to write their own support code??

Please explain how we cope without this, so if a problem remains we can fix
by the freeze.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] WIP: Failover Slots

2016-04-05 Thread Craig Ringer
On 5 April 2016 at 04:19, Oleksii Kliukin  wrote:


> Thank you for the update. I’ve got some rejects when applying the
> 0001-Allow-replication-slots-to-follow-failover.patch after the "Dirty
> replication slots when confirm_lsn is changed” changes. I think it should
> be rebased against the master, (might be the consequence of the "logical
> slots follow the timeline” patch committed).
>

I'll rebase it on top of the new master after timeline following for
logical slots got committed and follow up shortly.

That said, I've marked this patch 'returned with feedback' in the CF. It
should possibly actually be 'rejected' given the discussion on the logical
decoding timeline following thread, which points heavily at a different
approach to solving this problem in 9.7.

That doesn't mean nobody can pick it up if they think it's valuable and
want to run with it, but we're very close to feature freeze now.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] WIP: Failover Slots

2016-04-04 Thread Oleksii Kliukin
Hi,

> On 17 Mar 2016, at 09:34, Craig Ringer  wrote:
> 
> OK, here's the latest failover slots patch, rebased on top of today's master 
> plus, in order:
> 
> - Dirty replication slots when confirm_lsn is changed
>   
> (http://www.postgresql.org/message-id/camsr+yhj0oycug2zbyqprhxmcjnkt9d57msxdzgwbkcvx3+...@mail.gmail.com
>  
> )
> 
> - logical decoding timeline following
>   
> (http://www.postgresql.org/message-id/CAMsr+YH-C1-X_+s=2nzapnr0wwqja-rumvhsyyzansn93mu...@mail.gmail.com
>  
> )
> 
> The full tree is at 
> https://github.com/2ndQuadrant/postgres/tree/dev/failover-slots 
>  if you want 
> to avoid the fiddling around required to apply the patch series.
> 
> 
> <0001-Allow-replication-slots-to-follow-failover.patch><0002-Update-decoding_failover-tests-for-failover-slots.patch><0003-Retain-extra-WAL-for-failover-slots-in-base-backups.patch><0004-Add-the-UI-and-for-failover-slots.patch><0005-Document-failover-slots.patch><0006-Add-failover-to-pg_replication_slots.patch><0007-Introduce-TAP-recovery-tests-for-failover-slots.patch>



Thank you for the update. I’ve got some rejects when applying the 
0001-Allow-replication-slots-to-follow-failover.patch after the "Dirty 
replication slots when confirm_lsn is changed” changes. I think it should be 
rebased against the master, (might be the consequence of the "logical slots 
follow the timeline” patch committed).

patch -p1 
<~/git/pg/patches/failover-slots/v6/0001-Allow-replication-slots-to-follow-failover.patch
patching file src/backend/access/rmgrdesc/Makefile
Hunk #1 FAILED at 10.
1 out of 1 hunk FAILED -- saving rejects to file 
src/backend/access/rmgrdesc/Makefile.rej
patching file src/backend/access/rmgrdesc/replslotdesc.c
patching file src/backend/access/transam/rmgr.c
Hunk #1 succeeded at 25 (offset 1 line).
patching file src/backend/access/transam/xlog.c
Hunk #1 succeeded at 6351 (offset 3 lines).
Hunk #2 succeeded at 8199 (offset 14 lines).
Hunk #3 succeeded at 8645 (offset 14 lines).
Hunk #4 succeeded at 8718 (offset 14 lines).
patching file src/backend/commands/dbcommands.c
patching file src/backend/replication/basebackup.c
patching file src/backend/replication/logical/decode.c
Hunk #1 FAILED at 143.
1 out of 1 hunk FAILED -- saving rejects to file 
src/backend/replication/logical/decode.c.rej
patching file src/backend/replication/logical/logical.c
patching file src/backend/replication/slot.c
patching file src/backend/replication/slotfuncs.c
patching file src/backend/replication/walsender.c
patching file src/bin/pg_xlogdump/replslotdesc.c
patching file src/bin/pg_xlogdump/rmgrdesc.c
Hunk #1 succeeded at 27 (offset 1 line).
patching file src/include/access/rmgrlist.h
Hunk #1 FAILED at 45.
1 out of 1 hunk FAILED -- saving rejects to file 
src/include/access/rmgrlist.h.rej
patching file src/include/replication/slot.h
patching file src/include/replication/slot_xlog.h
can't find file to patch at input line 1469
Perhaps you used the wrong -p or --strip option?



--
Oleksii



Re: [HACKERS] WIP: Failover Slots

2016-03-16 Thread Craig Ringer
On 15 March 2016 at 21:40, Craig Ringer  wrote:

> Here's a new failover slots rev, addressing the issues Oleksii Kliukin
> raised and adding a bunch of TAP tests.
>

Ahem, just found an issue here. I'll need to send another revision.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] WIP: Failover Slots

2016-03-15 Thread Craig Ringer
Here's a new failover slots rev, addressing the issues Oleksii Kliukin
raised and adding a bunch of TAP tests.

In particular, for the checkpoint issue I landed up moving
CheckPointReplicationSlots to occur at the start of a checkpoint, before
writing WAL is prohibited. As the comments note it's just a convenient
place and time to do it anyway. That means it has to be called separately
at a restartpoint, but I don't think that's a biggie.

The tests for this took me quite a while, much (much) longer than the code
changes.

I split the patch up a bit more too so individual changes are more
logically grouped and clearer. I expect it'd be mostly or entirely squashed
for commit.
From 256d43f4c8195c893efeb0319d7642853d15f3a9 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Tue, 23 Feb 2016 15:59:37 +0800
Subject: [PATCH 1/7] Allow replication slots to follow failover

Originally replication slots were unique to a single node and weren't
recorded in WAL or replicated. A logical decoding client couldn't follow
a physical standby failover and promotion because the promoted replica
didn't have the original master's slots. The replica may not have
retained all required WAL and there was no way to create a new logical
slot and rewind it back to the point the logical client had replayed to.

Failover slots lift this limitation by replicating slots consistently to
physical standbys, keeping them up to date and using them in WAL
retention calculations. This allows a logical decoding client to follow
a physical failover and promotion without losing its place in the change
stream.

A failover slot may only be created on a master server, as it must be
able to write WAL. This limitation may be lifted later.

pg_basebackup is also modified to copy the contents of pg_replslot.
Non-failover slots will now be removed during backend startup instead
of being omitted from the copy.

This patch does not add any user interface for failover slots. There's
no way to create them from SQL or from the walsender. That and the
documentation for failover slots are in the next patch in the series
so that this patch is entirely focused on the implementation.

Craig Ringer, based on a prototype by Simon Riggs
---
 src/backend/access/rmgrdesc/Makefile   |   2 +-
 src/backend/access/rmgrdesc/replslotdesc.c |  65 +++
 src/backend/access/transam/rmgr.c  |   1 +
 src/backend/access/transam/xlog.c  |   5 +-
 src/backend/commands/dbcommands.c  |   3 +
 src/backend/replication/basebackup.c   |  12 -
 src/backend/replication/logical/decode.c   |   1 +
 src/backend/replication/logical/logical.c  |  25 +-
 src/backend/replication/slot.c | 586 +++--
 src/backend/replication/slotfuncs.c|   4 +-
 src/backend/replication/walsender.c|   8 +-
 src/bin/pg_xlogdump/replslotdesc.c |   1 +
 src/bin/pg_xlogdump/rmgrdesc.c |   1 +
 src/include/access/rmgrlist.h  |   1 +
 src/include/replication/slot.h |  69 +--
 src/include/replication/slot_xlog.h| 100 
 .../modules/decoding_failover/decoding_failover.c  |   6 +-
 17 files changed, 758 insertions(+), 132 deletions(-)
 create mode 100644 src/backend/access/rmgrdesc/replslotdesc.c
 create mode 12 src/bin/pg_xlogdump/replslotdesc.c
 create mode 100644 src/include/replication/slot_xlog.h

diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index c72a1f2..600b544 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -10,7 +10,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o \
 	   hashdesc.o heapdesc.o mxactdesc.o nbtdesc.o relmapdesc.o \
-	   replorigindesc.o seqdesc.o smgrdesc.o spgdesc.o \
+	   replorigindesc.o replslotdesc.o seqdesc.o smgrdesc.o spgdesc.o \
 	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/replslotdesc.c b/src/backend/access/rmgrdesc/replslotdesc.c
new file mode 100644
index 000..5829e8d
--- /dev/null
+++ b/src/backend/access/rmgrdesc/replslotdesc.c
@@ -0,0 +1,65 @@
+/*-
+ *
+ * replslotdesc.c
+ *	  rmgr descriptor routines for replication/slot.c
+ *
+ * Portions Copyright (c) 2015, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/rmgrdesc/replslotdesc.c
+ *
+ *-
+ */
+#include "postgres.h"
+
+#include "replication/slot_xlog.h"
+
+void
+replslot_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & 

Re: [HACKERS] WIP: Failover Slots

2016-03-04 Thread Craig Ringer
On 24 February 2016 at 18:02, Craig Ringer  wrote:


> I really want to focus on the first patch, timeline following for logical
> slots. That part is much less invasive and is useful stand-alone. I'll move
> it to a separate CF entry and post it to a separate thread as I think it
> needs consideration independently of failover slots.
>

Just an update on the failover slots status: I've moved timeline following
for logical slots into its own patch set and CF entry and added a bunch of
tests.

https://commitfest.postgresql.org/9/488/

Some perl TAP test framework enhancements were needed for that; they're
mostly committed now with a few pending.

https://commitfest.postgresql.org/9/569/

Once some final changes are made to the tests for timeline following I'll
address the checkpoint issue in failover slots by doing the checkpoint of
slots at the start of a checkpoint/restartpoint, while we can still write
WAL. Per the comments in CheckPointReplicationSlots it's mostly done in a
checkpoint currently for convenience.

Then I'll write some TAP tests for failover slots and submit an updated
patch for them, by which time hopefully timeline following for logical
slots will be committed.

In other words this patch isn't dead, the foundations are just being
rebased out from under it.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] WIP: Failover Slots

2016-02-24 Thread Craig Ringer
On 24 February 2016 at 03:53, Oleksii Kliukin  wrote:


>
> I found the following issue when shutting down a master with a connected
> replica that uses a physical failover slot:
>
> 2016-02-23 20:33:42.546 CET,,,54998,,56ccb3f3.d6d6,3,,2016-02-23 20:33:07
> CET,,0,DEBUG,0,"performing replication slot checkpoint",""
> 2016-02-23 20:33:42.594 CET,,,55002,,56ccb3f3.d6da,4,,2016-02-23 20:33:07
> CET,,0,DEBUG,0,"archived transaction log file
> ""00010003""",""
> 2016-02-23 20:33:42.601 CET,,,54998,,56ccb3f3.d6d6,4,,2016-02-23 20:33:07
> CET,,0,PANIC,XX000,"concurrent transaction log activity while database
> system is shutting down",""
> 2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,5,,2016-02-23 20:33:07
> CET,,0,LOG,0,"checkpointer process (PID 54998) was terminated by signal
> 6: Abort trap",""
> 2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,6,,2016-02-23 20:33:07
> CET,,0,LOG,0,"terminating any other active server processes",
>
>
Odd that I didn't see that in my testing. Thanks very much for this. I
concur with your explanation.

Basically, the issue is that CreateCheckPoint calls
> CheckpointReplicationSlots, which currently produces WAL, and this violates
> the assumption at line xlog.c:8492
>
> if (shutdown && checkPoint.redo != ProcLastRecPtr)
> ereport(PANIC,
> (errmsg("concurrent transaction log activity while database system is
> shutting down")));
>

Interesting problem.

It might be reasonably harmless to omit writing WAL for failover slots
during a shutdown checkpoint. We're using WAL to move data to the replicas
but we don't really need it for local redo and correctness on the master.
The trouble is that we do of course redo failover slot updates on the
master and we don't really want a slot to go backwards vs its on-disk state
before a crash. That's not too harmful - but might be able to lead to us
losing a slot catalog_xmin increase so the slot thinks catalog is still
readable that could've actually been vacuumed away.

CheckpointReplicationSlots notes that:

 * This needn't actually be part of a checkpoint, but it's a convenient
 * location.

... and I suspect the answer there is simply to move the slot checkpoint to
occur prior to the WAL checkpoint rather than during it. I'll investigate.


I really want to focus on the first patch, timeline following for logical
slots. That part is much less invasive and is useful stand-alone. I'll move
it to a separate CF entry and post it to a separate thread as I think it
needs consideration independently of failover slots.


(BTW, the slot docs promise that slots will replay a change exactly once,
but this is not correct and the client must keep track of replay position.
I'll post a patch to correct it separately).


> There are a couple of incorrect comments
>

Thanks, will amend.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] WIP: Failover Slots

2016-02-23 Thread Oleksii Kliukin

> On 23 Feb 2016, at 11:30, Craig Ringer  wrote:
> 
>  
> Updated patch 
> 
> ... attached
> 
> I've split it up a bit more too, so it's easier to tell what change is for 
> what and fixed the issues mentioned by Oleksii. I've also removed some 
> unrelated documentation changes.
> 
> Patch 0001, timeline switches for logical decoding, is unchanged since the 
> last post. 

Thank you, I read the user-interface part now, looks good to me.

I found the following issue when shutting down a master with a connected 
replica that uses a physical failover slot:

2016-02-23 20:33:42.546 CET,,,54998,,56ccb3f3.d6d6,3,,2016-02-23 20:33:07 
CET,,0,DEBUG,0,"performing replication slot checkpoint",""
2016-02-23 20:33:42.594 CET,,,55002,,56ccb3f3.d6da,4,,2016-02-23 20:33:07 
CET,,0,DEBUG,0,"archived transaction log file 
""00010003""",""
2016-02-23 20:33:42.601 CET,,,54998,,56ccb3f3.d6d6,4,,2016-02-23 20:33:07 
CET,,0,PANIC,XX000,"concurrent transaction log activity while database system 
is shutting down",""
2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,5,,2016-02-23 20:33:07 
CET,,0,LOG,0,"checkpointer process (PID 54998) was terminated by signal 6: 
Abort trap",""
2016-02-23 20:33:43.537 CET,,,54995,,56ccb3f3.d6d3,6,,2016-02-23 20:33:07 
CET,,0,LOG,0,"terminating any other active server processes",

Basically, the issue is that CreateCheckPoint calls CheckpointReplicationSlots, 
which currently produces WAL, and this violates the assumption at line 
xlog.c:8492

if (shutdown && checkPoint.redo != ProcLastRecPtr)
ereport(PANIC,
(errmsg("concurrent transaction log activity 
while database system is shutting down")));


There are a couple of incorrect comments

logical.c: 90
There's some things missing to allow this: I think it should be “There are some 
things missing to allow this:”

logical.c:93
"we need we would need”

slot.c:889
"and there's no latch to set, so poll” - clearly there is a latch used in the 
code below.

Also,  slot.c:301 emits an error message for an attempt to create a failover 
slot on the replica after acquiring and releasing the locks and getting the 
shared memory slot, even though all the data to check for this condition is 
available right at the beginning of the function. Shouldn’t it avoid the extra 
work if it’s not needed? 

> 
> 
> -- 
>  Craig Ringer   http://www.2ndQuadrant.com/ 
> 
>  PostgreSQL Development, 24x7 Support, Training & Services
> <0001-Allow-logical-slots-to-follow-timeline-switches.patch><0002-Allow-replication-slots-to-follow-failover.patch><0003-Retain-extra-WAL-for-failover-slots-in-base-backups.patch><0004-Add-the-UI-and-for-failover-slots.patch><0005-Document-failover-slots.patch><0006-Add-failover-to-pg_replication_slots.patch><0007-not-for-inclusion-Test-script-for-failover-slots.patch>


Kind regards,
--
Oleksii



Re: [HACKERS] WIP: Failover Slots

2016-02-23 Thread Craig Ringer
On 22 February 2016 at 23:39, Oleksii Kliukin  wrote:


> What it’s doing is calling pg_basebackup first to initialize the replica,
> and that actually failed with:
>
> _basebackup: unexpected termination of replication stream: ERROR:
>  requested WAL segment 0001 has already been removed
>
> The segment name definitely looks bogus to me.
>
> The actual command causing the failure was an attempt to clone the replica
> using pg_basebackup, turning on xlog streaming:
>
> pg_basebackup --pgdata data/postgres1 --xlog-method=stream
> --dbname="host=localhost port=5432  user=replicator”
>
> I checked the same command against the  git master without the patches
> applied and could not reproduce this problem there.
>

That's a bug. In testing whether we need to return a lower LSN for minimum
WAL for BASE_BACKUP it failed to properly test for InvalidXLogRecPtr . Good
catch.


> On the code level, I have no comments on 0001, it’s well documented and I
> have no questions about the approach, although I might be not too
> knowledgable to judge the specifics of the implementation.
>

The first patch is the most important IMO, and the one I think needs the
most thought since it's ... well, timelines aren't simple.


> slots.c:294
> elog(LOG, "persistency is %i", (int)slot->data.persistency);
>
> Should be changed to DEBUG?
>

That's an escapee log statement I thought I'd already rebased out. Well
spotted, fixed.


>
> slot.c:468
> Why did you drop “void" as a parameter type of ReplicationSlotDropAcquired?
>

That's an editing error on my part that I'll reverse. Since the prototype
declares (void) it doesn't matter, but it's a pointless change. Fixed.


> walsender.c: 1509 at PhysicalConfirmReceivedLocation
>
> I’ve noticed a comment stating that we don’t need to call
> ReplicationSlotSave(), but that pre-dated the WAL-logging of replication
> slot changes. Don’t we need to call it now, the same way it’s done for
> the logical slots in logical.c:at LogicalConfirmReceivedLocation?
>

No, it's safe here. All we must ensure is that a slot is advanced on the
replica when it's advanced on the master. For physical slots even that's a
weak requirement, we just have to stop them from falling *too* far behind
and causing too much xlog retention. For logical slots we should ensure we
advance the slot on the replica before any vacuum activity that might
remove catalog tuples still needed by that slot gets replayed. Basically
the difference is that logical slots keep track of the catalog xmin too, so
they have (slightly) stricter requirements.

This patch doesn't touch either of those functions except for
renaming ReplicationSlotsComputeRequiredLSN
to ReplicationSlotsUpdateRequiredLSN . Which, by the way, I really don't
like doing, but I couldn't figure out a name to give the function that
computes-and-returns the required LSN that wouldn't be even more confusing
in the face of having a ReplicationSlotsComputeRequiredLSN function as
well. Ideas welcome.

Updated patch

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] WIP: Failover Slots

2016-02-22 Thread Oleksii Kliukin
Hi,

> On 16 Feb 2016, at 09:11, Craig Ringer  wrote:
> 
> 
> 
> Revision attached. There was a file missing from the patch too.
> 

All attached patches apply normally. I only took a look at first 2, but also 
tried to run the Patroni with the modified version to check whether the basic 
replication works.

What it’s doing is calling pg_basebackup first to initialize the replica, and 
that actually failed with:

_basebackup: unexpected termination of replication stream: ERROR:  requested 
WAL segment 0001 has already been removed

The segment name definitely looks bogus to me.

The actual command causing the failure was an attempt to clone the replica 
using pg_basebackup, turning on xlog streaming:

pg_basebackup --pgdata data/postgres1 --xlog-method=stream 
--dbname="host=localhost port=5432  user=replicator”

I checked the same command against the  git master without the patches applied 
and could not reproduce this problem there.

On the code level, I have no comments on 0001, it’s well documented and I have 
no questions about the approach, although I might be not too knowledgable to 
judge the specifics of the implementation.

On the 0002, there are a few rough edges:


slots.c:294
elog(LOG, "persistency is %i", (int)slot->data.persistency);

Should be changed to DEBUG?

slot.c:468
Why did you drop “void" as a parameter type of ReplicationSlotDropAcquired?

walsender.c: 1509 at PhysicalConfirmReceivedLocation

I’ve noticed a comment stating that we don’t need to call 
ReplicationSlotSave(), but that pre-dated the WAL-logging of replication slot 
changes. Don’t we need to call it now, the same way it’s done for the logical 
slots in logical.c:at LogicalConfirmReceivedLocation?

Kind regards,
--
Oleksii



Re: [HACKERS] WIP: Failover Slots

2016-02-15 Thread Petr Jelinek
Hi,

here is my code level review:

0001:
This one looks ok except for broken indentation in the new notes in
xlogreader.c and .h. It's maybe slightly overdocumented but given the
complicated way the timeline reading works it's probably warranted.

0002:
+/*
+ * No way to wait for the process since it's not a child
+ * of ours and there's no latch to set, so poll.
+ *
+ * We're checking this without any locks held, but
+ * we'll recheck when we attempt to drop the slot.
+ */
+while (slot->in_use && slot->active_pid == active_pid
+&& max_sleep_micros > 0)
+{
+usleep(micros_per_sleep);
+max_sleep_micros -= micros_per_sleep;
+}

Not sure I buy this, what about postmaster crashes and fast shutdown
requests etc. Also you do usleep for 10s which is quite long. I'd
prefer the classic wait for latch with timeout and pg crash check
here. And even if we go with usleep, then it should be 1s not 10s and
pg_usleep instead of usleep.

0003:
There is a lot of documentation improvements here that are not related
to failover slots or timeline following, it might be good idea to
split those into separate patch as they are separately useful IMHO.

Other than that it looks good to me.

About other things discussed in this thread. Yes it makes sense in
certain situations to handle this outside of WAL and that does require
notions of nodes, etc. That being said, the timeline following is
needed even if this is handled outside of WAL. And once timeline
following is in, the slots can be handled by the replication solution
itself which is good. But I think the failover slots are still a good
thing to have - it provides HA solution for anything that uses slots,
and that includes physical replication as well. If the specific
logical replication solution wants to handle it for some reason itself
outside of WAL, it can create non-failover slot so in my opinion we
ideally need both types of slots (and that's exactly what this patch
gives us).

-- 
  Petr Jelinek  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Failover Slots

2016-02-03 Thread Robert Haas
On Fri, Jan 22, 2016 at 11:51 AM, Andres Freund  wrote:
> I think it's technically quite possible to maintain the required
> resources on multiple nodes. The question is how would you configure on
> which nodes the resources need to be maintained? I can't come up with a
> satisfying scheme...

For this to work, I feel like the nodes need names, and a directory
that tells them how to reach each other.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Failover Slots

2016-01-27 Thread Craig Ringer
Hi all

Here's v3 of failover slots.

It doesn't add the UI yet, but it's now functionally complete except for
timeline following for logical slots, and I have a plan for that.
From 533a9327b54ba744b0a1fb0048e8cfe7d3d45ea1 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Wed, 20 Jan 2016 17:16:29 +0800
Subject: [PATCH 1/2] Implement failover slots

Originally replication slots were unique to a single node and weren't
recorded in WAL or replicated. A logical decoding client couldn't follow
a physical standby failover and promotion because the promoted replica
didn't have the original master's slots. The replica may not have
retained all required WAL and there was no way to create a new logical
slot and rewind it back to the point the logical client had replayed to
anyway.

Failover slots lift this limitation by replicating slots consistently to
physical standbys, keeping them up to date and using them in WAL
retention calculations. This allows a logical decoding client to follow
a physical failover and promotion without losing its place in the change
stream.

Simon Riggs and Craig Ringer

WIP. Open items:

* Testing
* Implement !failover slots and UI for marking slots as failover slots
* Fix WAL retention for slots created before a basebackup
---
 src/backend/access/rmgrdesc/Makefile   |   2 +-
 src/backend/access/rmgrdesc/replslotdesc.c |  63 
 src/backend/access/transam/rmgr.c  |   1 +
 src/backend/access/transam/xlogutils.c |   6 +-
 src/backend/commands/dbcommands.c  |   3 +
 src/backend/replication/basebackup.c   |  12 -
 src/backend/replication/logical/decode.c   |   1 +
 src/backend/replication/logical/logical.c  |  19 +-
 src/backend/replication/logical/logicalfuncs.c |   3 +
 src/backend/replication/slot.c | 439 -
 src/backend/replication/slotfuncs.c|   1 +
 src/bin/pg_xlogdump/replslotdesc.c |   1 +
 src/bin/pg_xlogdump/rmgrdesc.c |   1 +
 src/include/access/rmgrlist.h  |   1 +
 src/include/replication/slot.h |  61 +---
 src/include/replication/slot_xlog.h| 103 ++
 16 files changed, 624 insertions(+), 93 deletions(-)
 create mode 100644 src/backend/access/rmgrdesc/replslotdesc.c
 create mode 12 src/bin/pg_xlogdump/replslotdesc.c
 create mode 100644 src/include/replication/slot_xlog.h

diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index c72a1f2..600b544 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -10,7 +10,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o \
 	   hashdesc.o heapdesc.o mxactdesc.o nbtdesc.o relmapdesc.o \
-	   replorigindesc.o seqdesc.o smgrdesc.o spgdesc.o \
+	   replorigindesc.o replslotdesc.o seqdesc.o smgrdesc.o spgdesc.o \
 	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/replslotdesc.c b/src/backend/access/rmgrdesc/replslotdesc.c
new file mode 100644
index 000..b882846
--- /dev/null
+++ b/src/backend/access/rmgrdesc/replslotdesc.c
@@ -0,0 +1,63 @@
+/*-
+ *
+ * replslotdesc.c
+ *	  rmgr descriptor routines for replication/slot.c
+ *
+ * Portions Copyright (c) 2015, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/rmgrdesc/replslotdesc.c
+ *
+ *-
+ */
+#include "postgres.h"
+
+#include "replication/slot_xlog.h"
+
+void
+replslot_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_REPLSLOT_UPDATE:
+			{
+ReplicationSlotInWAL xlrec;
+
+xlrec = (ReplicationSlotInWAL) rec;
+
+appendStringInfo(buf, "slot %s to xmin=%u, catmin=%u, restart_lsn="UINT64_FORMAT"@%u",
+		NameStr(xlrec->name), xlrec->xmin, xlrec->catalog_xmin,
+		xlrec->restart_lsn, xlrec->restart_tli);
+
+break;
+			}
+		case XLOG_REPLSLOT_DROP:
+			{
+xl_replslot_drop *xlrec;
+
+xlrec = (xl_replslot_drop *) rec;
+
+appendStringInfo(buf, "slot %s", NameStr(xlrec->name));
+
+break;
+			}
+	}
+}
+
+const char *
+replslot_identify(uint8 info)
+{
+	switch (info)
+	{
+		case XLOG_REPLSLOT_UPDATE:
+			return "CREATE_OR_UPDATE";
+		case XLOG_REPLSLOT_DROP:
+			return "DROP";
+		default:
+			return NULL;
+	}
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 7c4d773..0bd5796 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -24,6 +24,7 @@
 #include "commands/sequence.h"
 #include "commands/tablespace.h"
 #include 

Re: [HACKERS] WIP: Failover Slots

2016-01-26 Thread Craig Ringer
Hi all

Here's v2 of the failover slots patch. It replicates a logical slot to a
physical streaming replica downstream, keeping the slots in sync. After the
downstream is promoted a client can replay from the logical slot.

UI to allow creation of non-failover slots is pending.

There's more testing to do to cover all the corners: drop slots, drop and
re-create, name conflicts between downstream !failover slots and upstream
failover slots, etc.

There's also a known bug where WAL isn't correctly retained for a slot
where that slot was created before a basebackup which I'll fix in a
revision shortly.

I'm interested in ideas on how to better test this.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/

 PostgreSQL Development, 24x7 Support, Training & Services
From c2535eb27c6efc516a6aa7142fd0f59e85d3 Mon Sep 17 00:00:00 2001
From: Craig Ringer 
Date: Wed, 20 Jan 2016 17:16:29 +0800
Subject: [PATCH 1/2] Implement failover slots

Originally replication slots were unique to a single node and weren't
recorded in WAL or replicated. A logical decoding client couldn't follow
a physical standby failover and promotion because the promoted replica
didn't have the original master's slots. The replica may not have
retained all required WAL and there was no way to create a new logical
slot and rewind it back to the point the logical client had replayed to
anyway.

Failover slots lift this limitation by replicating slots consistently to
physical standbys, keeping them up to date and using them in WAL
retention calculations. This allows a logical decoding client to follow
a physical failover and promotion without losing its place in the change
stream.

Simon Riggs and Craig Ringer

WIP. Open items:

* Testing
* Implement !failover slots and UI for marking slots as failover slots
* Fix WAL retention for slots created before a basebackup
---
 src/backend/access/rmgrdesc/Makefile   |   2 +-
 src/backend/access/rmgrdesc/replslotdesc.c |  63 +
 src/backend/access/transam/rmgr.c  |   1 +
 src/backend/commands/dbcommands.c  |   3 +
 src/backend/replication/basebackup.c   |  12 -
 src/backend/replication/logical/decode.c   |   1 +
 src/backend/replication/logical/logical.c  |  19 +-
 src/backend/replication/slot.c | 433 -
 src/backend/replication/slotfuncs.c|   1 +
 src/bin/pg_xlogdump/replslotdesc.c |   1 +
 src/bin/pg_xlogdump/rmgrdesc.c |   1 +
 src/include/access/rmgrlist.h  |   1 +
 src/include/replication/slot.h |  61 +---
 src/include/replication/slot_xlog.h| 103 +++
 14 files changed, 610 insertions(+), 92 deletions(-)
 create mode 100644 src/backend/access/rmgrdesc/replslotdesc.c
 create mode 12 src/bin/pg_xlogdump/replslotdesc.c
 create mode 100644 src/include/replication/slot_xlog.h

diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index c72a1f2..600b544 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -10,7 +10,7 @@ include $(top_builddir)/src/Makefile.global
 
 OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o gindesc.o gistdesc.o \
 	   hashdesc.o heapdesc.o mxactdesc.o nbtdesc.o relmapdesc.o \
-	   replorigindesc.o seqdesc.o smgrdesc.o spgdesc.o \
+	   replorigindesc.o replslotdesc.o seqdesc.o smgrdesc.o spgdesc.o \
 	   standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/replslotdesc.c b/src/backend/access/rmgrdesc/replslotdesc.c
new file mode 100644
index 000..b882846
--- /dev/null
+++ b/src/backend/access/rmgrdesc/replslotdesc.c
@@ -0,0 +1,63 @@
+/*-
+ *
+ * replslotdesc.c
+ *	  rmgr descriptor routines for replication/slot.c
+ *
+ * Portions Copyright (c) 2015, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/rmgrdesc/replslotdesc.c
+ *
+ *-
+ */
+#include "postgres.h"
+
+#include "replication/slot_xlog.h"
+
+void
+replslot_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_REPLSLOT_UPDATE:
+			{
+ReplicationSlotInWAL xlrec;
+
+xlrec = (ReplicationSlotInWAL) rec;
+
+appendStringInfo(buf, "slot %s to xmin=%u, catmin=%u, restart_lsn="UINT64_FORMAT"@%u",
+		NameStr(xlrec->name), xlrec->xmin, xlrec->catalog_xmin,
+		xlrec->restart_lsn, xlrec->restart_tli);
+
+break;
+			}
+		case XLOG_REPLSLOT_DROP:
+			{
+xl_replslot_drop *xlrec;
+
+xlrec = (xl_replslot_drop *) rec;
+
+appendStringInfo(buf, "slot %s", NameStr(xlrec->name));
+
+break;
+			}
+	}
+}
+
+const char *

Re: [HACKERS] WIP: Failover Slots

2016-01-25 Thread Craig Ringer
On 23 January 2016 at 00:40, Robert Haas  wrote:


> It occurred to me to wonder if it might be better to
> propagate logical slots partially or entirely outside the WAL stream,
> because with this design you will end up with the logical slots on
> every replica, including cascaded replicas, and I'm not sure that's
> what we want.  Then again, I'm also not sure it isn't what we want.
>

I think it's the most sensible default if there's only going to be one
choice to start with. It's consistent with what we do elsewhere with
replicas so there won't be any surprises.   Failover slots are a fairly
simple feature that IMO just makes slots behave more like you might expect
them to do in the first place.

I'm pretty hesitant to start making cascaded replicas different to each
other just for slots. There are lots of other things where variation
between replicas would be lovely - the most obvious of which is omitting
some databases from some replicas. Right now we have a single data stream,
WAL, that goes to every replica. If we're going to change that I'd really
like to address it in a way that'll meet future needs like selective
physical replication too. I also think we'd want to deal with the problem
of identifying and labeling nodes to do a good job of selective replication
of slots.

I'd like to get failover slots in place for 9.6 since the're fairly
self-contained and meet an immediate need: allowing replication using slots
(physical or logical) to follow a failover event.

After that I want to add logical decoding support for slot
creation/drop/advance. So a logical decoding plugin can mirror logical slot
state on another node. It wouldn't be useful for physical slots, of course,
but it'd allow failover between logical replicas where we can do cool
things like replicate just a subset of DBs/tables/etc. (A mapping of
upstream to downstream LSNs would be required, but hopefully not too hard).
That's post-9.6 work and separate to failover slots, though dependent on
them for the ability to decode slot changes from WAL.

Down the track I'd very much like to be less dependent on forcing
everything though WAL with a single timeline. I agree with Andres that
being able to use failover slots on replicas would be good, and that it's
not possible when we use WAL to track the info. I just think it's a massive
increase in complexity and scope and I'd really like to be able to follow
failover the simple way first, then enhance it.

Nothing stops us changing failover slots in 9.7+ from using WAL to some
other mechanism that we design carefully at that time. We can extend the
walsender command set for physical rep at a major version bump with no
major issues, including adding to the streaming rep protocol. There's lots
to figure out though, including how to maintain slot changes in a strict
ordering with WAL, how to store and read the info, etc.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] WIP: Failover Slots

2016-01-25 Thread Craig Ringer
On 23 January 2016 at 00:51, Andres Freund  wrote:


> Not propagating them through the WAL also has the rather large advantage
> of not barring the way to using such slots on standbys.
>

Yeah. So you could have a read-replica that has a slot and it has child
nodes you can fail over to, but you don't have to have the slot on the
master.

I don't personally find that to be a particularly compelling thing that
says "we must have this" ... but maybe I'm not seeing the full
significance/advantages.


> I think it's technically quite possible to maintain the required
> resources on multiple nodes. The question is how would you configure on
> which nodes the resources need to be maintained? I can't come up with a
> satisfying scheme...
>

That's part of it. Also the mechanism by which we actually replicate them -
protocol additions for the walsender protocol, how to reliably send
something that doesn't have an LSN, etc. It might be fairly simple, I
haven't thought about it deeply, but I'd rather not go there until the
basics are in place.

BTW, I'm keeping a working tree at
https://github.com/2ndQuadrant/postgres/tree/dev/failover-slots . Subject
to rebasing, history not clean. It has a test script in it that'll go away
before patch posting.

Current state needs work to ensure that on-disk and in-memory
representations are kept in sync, but is getting there.



-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] WIP: Failover Slots

2016-01-22 Thread Robert Haas
On Fri, Jan 22, 2016 at 2:46 AM, Craig Ringer  wrote:
> It's also going to be necessary to handle what happens when a new failover
> slot (physical or logical) is created on the master where it conflicts with
> the name of a non-failover physical slot that was created on the replica. In
> this case I am inclined to terminate any walsender backend for the replica's
> slot with a conflict with recovery, remove its slot and replace it with a
> failover slot. The failover slot does not permit replay while in recovery so
> if the booted-off client reconnects it'll get an ERROR saying it can't
> replay from a failover slot. It should be pretty clear to the admin what's
> happening between the conflict with recovery and the failover slot error.
> There could still be an issue if the client persistently keeps retrying and
> successfully reconnects after replica promotion but I don't see that much
> can be done about that. The documentation will need to address the need to
> try to avoid name conflicts between slots created on replicas and failover
> slots on the master.

That's not going to win any design-of-the-year awards, but maybe it's
acceptable.  It occurred to me to wonder if it might be better to
propagate logical slots partially or entirely outside the WAL stream,
because with this design you will end up with the logical slots on
every replica, including cascaded replicas, and I'm not sure that's
what we want.  Then again, I'm also not sure it isn't what we want.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Failover Slots

2016-01-22 Thread Andres Freund
On 2016-01-22 11:40:24 -0500, Robert Haas wrote:
> It occurred to me to wonder if it might be better to
> propagate logical slots partially or entirely outside the WAL stream,
> because with this design you will end up with the logical slots on
> every replica, including cascaded replicas, and I'm not sure that's
> what we want.  Then again, I'm also not sure it isn't what we want.

Not propagating them through the WAL also has the rather large advantage
of not barring the way to using such slots on standbys.

I think it's technically quite possible to maintain the required
resources on multiple nodes. The question is how would you configure on
which nodes the resources need to be maintained? I can't come up with a
satisfying scheme...


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Failover Slots

2016-01-21 Thread Robert Haas
On Fri, Jan 1, 2016 at 7:50 PM, Simon Riggs  wrote:
> Failover Slots
> In the current patch, any slot defined on a master will generate WAL,
> leading to a pending-slot being present on all standby nodes. When a standby
> is promoted, the slot becomes usable and will have the properties as of the
> last fsync on the master.

No objection to the concept, but I think the new behavior needs to be
optional.  I am guessing that you are thinking along similar lines,
but it's not explicitly called out here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Failover Slots

2016-01-21 Thread Simon Riggs
On 21 January 2016 at 16:31, Robert Haas  wrote:

> On Fri, Jan 1, 2016 at 7:50 PM, Simon Riggs  wrote:
> > Failover Slots
> > In the current patch, any slot defined on a master will generate WAL,
> > leading to a pending-slot being present on all standby nodes. When a
> standby
> > is promoted, the slot becomes usable and will have the properties as of
> the
> > last fsync on the master.
>
> No objection to the concept, but I think the new behavior needs to be
> optional.  I am guessing that you are thinking along similar lines,
> but it's not explicitly called out here.
>

I was unsure myself; but making them optional seems reasonable.

-- 
Simon Riggshttp://www.2ndQuadrant.com/

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] WIP: Failover Slots

2016-01-21 Thread Craig Ringer
On 22 January 2016 at 00:31, Robert Haas  wrote:

> On Fri, Jan 1, 2016 at 7:50 PM, Simon Riggs  wrote:
> > Failover Slots
> > In the current patch, any slot defined on a master will generate WAL,
> > leading to a pending-slot being present on all standby nodes. When a
> standby
> > is promoted, the slot becomes usable and will have the properties as of
> the
> > last fsync on the master.
>
> No objection to the concept, but I think the new behavior needs to be
> optional.  I am guessing that you are thinking along similar lines,
> but it's not explicitly called out here.


Yeah,  I think that's the idea too. For one thing we'll want to allow
non-failover slots to continue to be usable from a streaming replica, but
we must ERROR if anyone attempts to attach to and replay from a failover
slot via a replica since we can't write WAL there. Both kinds are needed.

There's a 'failover' bool member in the slot persistent data for that
reason. It's not (yet) exposed via the UI.

I presume we'll want to:

* add a new default-false argument is_failover_slot or similar to
pg_create_logical_replication_slot and pg_create_physical_replication_slot

* Add a new optional flag argument FAILOVER to CREATE_REPLICATION_SLOT in
both its LOGICAL and PHYSICAL forms.

... and will be adding that to this patch, barring syntax objections etc.


It's also going to be necessary to handle what happens when a new failover
slot (physical or logical) is created on the master where it conflicts with
the name of a non-failover physical slot that was created on the replica.
In this case I am inclined to terminate any walsender backend for the
replica's slot with a conflict with recovery, remove its slot and replace
it with a failover slot. The failover slot does not permit replay while in
recovery so if the booted-off client reconnects it'll get an ERROR saying
it can't replay from a failover slot. It should be pretty clear to the
admin what's happening between the conflict with recovery and the failover
slot error. There could still be an issue if the client persistently keeps
retrying and successfully reconnects after replica promotion but I don't
see that much can be done about that. The documentation will need to
address the need to try to avoid name conflicts between slots created on
replicas and failover slots on the master.

I'll be working on those docs, on the name conflict handling, and on the
syntax during my coming flight. Toddler permitting.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] WIP: Failover Slots

2016-01-20 Thread Craig Ringer
On 20 January 2016 at 21:02, Craig Ringer  wrote:


> TL;DR: this doesn't work yet, working on it.
>

Nothing is logged on slot creation because ReplicationSlot->data.failover
is never true. Once that's fixed by - for now - making all slots failover
slots, there's a crash in XLogInsert because of the use of reserved bits in
the XLogInsert info argument. Fix pushed.

I also noticed that slot drops seem are being logged whether or not the
slot is a failover slot. Pushed a fix for that.

The WAL writing is now working. I've made improvements to the rmgr xlogdump
support to make it clearer what's written.

Slots are still not visible on the replica so there's work to do tracing
redo, promotion, slot handling after starting from a basebackup, etc. The
patch is still very much WIP.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] WIP: Failover Slots

2016-01-20 Thread Craig Ringer
On 2 January 2016 at 08:50, Simon Riggs  wrote:


> Patch is WIP, posted for comment, so you can see where I'm going.
>


I've applied this on a branch of master and posted it, with some comment
editorialization, as
https://github.com/2ndQuadrant/postgres/tree/dev/failover-slots . The tree
will be subject to rebasing.

At present the patch does not appear to work. No slots are visible in the
replica's pg_replication_slots before or after promotion and no slot
information is written to the xlog according to pg_xlogdump:

$ ~/pg/96/bin/pg_xlogdump -r ReplicationSlot 00010001
00010003
pg_xlogdump: FATAL:  error in WAL record at 0/301DDE0: invalid record
length at 0/301DE10

so it's very much a WIP. I've read through it and think the idea makes
sense, it's just still missing some pieces...



So. Initial review comments.



This looks pretty incomplete:

+ if (found_duplicate)
+ {
+ LWLockRelease(ReplicationSlotAllocationLock);
+
+ /*
+ * Do something nasty to the sinful duplicants, but
+ * take with locking.
+ */
+
+ LWLockAcquire(ReplicationSlotAllocationLock, LW_SHARED);
+ }

... and I'm not sure I understand how the possibility of a duplicate slot
can arise in the first place, since you cannot create a slot on a read
replica. This seems unnecessary.



I'm not sure I understand why, in ReplicationSlotRedoCreate, it's
especially desirable to prevent blocking iteration over
pg_replication_slots or acquiring a slot. When redoing a slot create isn't
that exactly what we should do? This looks like it's been copied & pasted
verbatim from CheckPointReplicationSlots . There it makes sense, since the
slots may be in active use. During redo it seems reasonable to just
acquire ReplicationSlotControlLock.



I'm not a fan of the ReplicationSlotInWAL typedef
for ReplicationSlotPersistentData. Especially as it's only used in the redo
functions but *not* when xlog is written out. I'd like to just replace it.


Purely for convenient testing there's a shell script in the tree -
https://github.com/2ndQuadrant/postgres/blob/dev/failover-slots/failover-slot-test.sh
.
Assuming a patched 9.6 in $HOME/pg/96 it'll do a run-through of the patch.
I'll attempt to convert this to use the new test infrastructure, but needed
something ASAP for development. Posted in case it's useful to others.

Now it's time to look into why this doesn't seem to be generating any xlog
when by rights it seems like it should. Also into at what point exactly we
purge existing slots on start / promotion of a read-replica.


TL;DR: this doesn't work yet, working on it.


-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services