On 05/29/2015 10:45 AM, Stephen Frost wrote: > Andres, > > * Andres Freund (and...@anarazel.de) wrote: >> On 2015-05-29 10:15:56 -0700, Josh Berkus wrote: >>> pg_drop_replication_slot() can be a time-critical function when the >>> master is running out of disk space because the replica is falling >>> behind. >> >> I don't buy this argument. The same is true for DROP TABLE, TRUNCATE, >> DROP DATABASE etc. > > I disagree about that being the same. > >> I mean, I agree it'd be convenient, but I can't see it as "critical".
So, here's they scenario: 1. you're almost out of disk space due to a replica falling behind, like down to 16mb left. Or maybe you are out of disk space. 2. You need to drop the laggy replication slots in a hurry to get your master working again. 3. Now you have to do this timing-sensitive two-stage drop to make it work. When our users are having production emergencies, I don't think that it's helpful for us to make the process of getting out of those situations more complicated than it absolutely has to be. > Just a random thought- do we check the LOGIN attribute for replication > connections? If so, you could tweak that, but that may be an issue if > you have multiple replicas using the same role. > > I'm not sure that it's *critical*, but I could see an argument for > adding this post-feature-freeze, which I'm guessing is what Josh was > getting at. Well, I'll let others decide that. If we could come up with a script which would reliably do the terminate-then-drop, it would be fine for 9.5. I'm not sure that's possible though, because I don't see any way to infallibly relate the pg_stat_replication entry with the pg_replication_slot entry. Imagine having 3 slots and 6 replicas, and only one slot is behind; how do you figure out what to terminate? > >>> While I'm just doing this during testing, it could be a critical fail in >>> production. I think the simplest way to resolve this would be to add a >>> boolean flag to pg_drop_replication_slot(), which would terminate the >>> replication connection and delete the slot as a single operation. >> >> There's no "single operation" for terminating a backend *and* doing >> something... > > That's a good point, we'd need to figure out how to make this actually > work reliably in the face of a very fast reconnecting process, if we're > going to do it. Yeah, which means that this is probably something for 9.6. Although if we can at least come up with something for the documentation for 9.5, it would be really helpful. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers