On Mon, Nov 7, 2022 at 11:17 PM Bharath Rupireddy < bharath.rupireddyforpostg...@gmail.com> wrote:
> On Tue, Nov 8, 2022 at 12:08 PM sirisha chamarthi > <sirichamarth...@gmail.com> wrote: > > > > On Fri, Nov 4, 2022 at 11:02 PM Amit Kapila <amit.kapil...@gmail.com> > wrote: > >> > >> On Fri, Nov 4, 2022 at 1:40 PM sirisha chamarthi > >> <sirichamarth...@gmail.com> wrote: > >> > > >> > A replication slot can be lost when a subscriber is not able to catch > up with the load on the primary and the WAL to catch up exceeds > max_slot_wal_keep_size. When this happens, target has to be reseeded > (pg_dump) from the scratch and this can take longer. I am investigating the > options to revive a lost slot. > >> > > >> > >> Why in the first place one has to set max_slot_wal_keep_size if they > >> care for WAL more than that? > > > > Disk full is a typical use where we can't wait until the logical slots > to catch up before truncating the log. > > If the max_slot_wal_keep_size is set appropriately and the replication > lag is monitored properly along with some automatic actions such as > replacing/rebuilding the standbys or subscribers (which may not be > easy and cheap though), the chances of hitting the "lost replication" > problem becomes less, but not zero always. > pg_dump and pg_restore can take several hours to days on a large database. Keeping the WAL in the pg_wal folder (faster, smaller and costly disks?) is not always an option. > > >> If you have a case where you want to > >> handle this case for some particular slot (where you are okay with the > >> invalidation of other slots exceeding max_slot_wal_keep_size) then the > >> other possibility could be to have a similar variable at the slot > >> level but not sure if that is a good idea because you haven't > >> presented any such case. > > > > IIUC, ability to fetch WAL from the archive as a fall back mechanism > should automatically take care of all the lost slots. Do you see a need to > take care of a specific slot? If the idea is not to download the wal files > in the pg_wal directory, they can be placed in a slot specific folder > (data/pg_replslot/<slot>/) until they are needed while decoding and can be > removed. > > Is the idea here the core copying back the WAL files from the archive? > If yes, I think it is not something the core needs to do. This very > well fits the job of an extension or an external module that revives > the lost replication slots by copying WAL files from archive location. > The current code is throwing an error that the slot is lost because the restart_lsn is set to invalid LSN when the WAL is truncated by checkpointer. In order to build an external service that can revive a lost slot, at the minimum we needed the patch attached. > > Having said above, what's the best way to revive a lost replication > slot today? Any automated way exists today? It seems like > pg_replication_slot_advance() doesn't do anything for the > invalidated/lost slots. > If the WAL is available in the pg_wal directory, the replication stream resumes normally when the client connects with the patch I posted. > > If it's a streaming replication slot, the standby will anyway jump to > archive mode ignoring the replication slot and the slot will never be > usable again unless somebody creates a new replication slot and > provides it to the standby for reuse. > If it's a logical replication slot, the subscriber will start to > diverge from the publisher and the slot will have to be revived > manually i.e. created again. > Physical slots can be revived with standby downloading the WAL from the archive directly. This patch is helpful for the logical slots. > > -- > Bharath Rupireddy > PostgreSQL Contributors Team > RDS Open Source Databases > Amazon Web Services: https://aws.amazon.com >