[Proposal] pg_rewind integration into core

2022-03-23 Thread RKN Sai Krishna
Hi,

It's possible to have a good number of standbys (in the context of async
streaming replication) as part of the client architecture. Rather than
asking the client to look into the intricacies of comparing the LSN of each
standby with that of primary and performing the pg_rewind, isn't it a good
idea to integrate the pg_rewind into the startup logic and perform
pg_rewind on need basis?

Considering the scenarios where primary is ahead of sync standbys, upon
promotion of a standby, pg_rewind is needed on the old primary if it has to
be up as a standby. Similarly in the scenarios where async standbys(in
physical replication context) go ahead of sync standbys, and upon promotion
of a standby, there is need for pg_rewind to be performed on the async
standbys which are ahead of sync standby being promoted.

With these scenarios under consideration, integrating pg_rewind into
postgres core might be a better option IMO. We could optionally choose to
have pg_rewind dry run performed during the standby startup and based on
the need, perform the rewind and have the standby in sync with the primary.

Would like to invite more thoughts from the hackers.

Regards,
RKN


Re: pg_walinspect - a new extension to get raw WAL data and WAL stats

2022-03-25 Thread RKN Sai Krishna
Hi Bharath,

First look at the patch, bear with me if any of the following comments are
repeated.
1. With pg_get_wal_record(lsn), say a WAL record start, end lsn range
contains the specified LSN, wouldn't it be more meaningful to show the
corresponding WAL record.
For example, upon providing '0/17335E7' as input, and I see get the WAL
record ('0/1733618', '0/173409F') as output and not the one with start and
end lsn as ('0/17335E0', '0/1733617').

With pg_walfile_name(lsn), we can find the WAL segment file name that
contains the specified LSN.

2. I see the following output for pg_get_wal_record. Need to have a look at
the spaces I suppose.
rkn=# select * from pg_get_wal_record('0/4041728');
 start_lsn |  end_lsn  | prev_lsn  | record_length |















 record















---+---+---+---+-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
---

pg_rewind enhancements

2022-03-04 Thread RKN Sai Krishna
Hi,

While using pg_rewind, I found that it is a bit difficult to use pg_rewind
as it seems to copy even the configuration files and also remove some of
the files created on the old primary which may not be present on the new
primary. Similarly it copies files under the data directory of the new
primary which may not be needed or which possibly could be junk files.

I would propose to have a couple of new command line arguments to
pg_rewind. One, a comma separated list of files which should be preserved
on the old primary, in other words which shouldn't be overwritten from the
new primary. Second, a comma separated list of files which should be
excluded while copying files from the new primary onto the old primary.

Would like to invite more thoughts from the hackers.

Regards,
RKN


Query regarding replication slots

2022-01-11 Thread RKN Sai Krishna
Hi All,

I have a very basic question related to replication slots. Why should
the master/primary server maintain the replication slot info like lsn
corresponding to each standby server etc. Instead, why can't each
standby server send the lsn that it needs, and master/primary server
maintain the minimum lsn across all of the standby servers so that the
information could be used for cleanup/removal of WAL segments?

The minimum lsn could as well be streamed to all of the standby
servers while streaming the WAL records, so that the cleanup on the
standby server as well happens as per the minimum lsn. Also, even if
the primary server crashes, any standby server becoming the master is
well aware of the minimum lsn and the WAL records required for all of
the remaining standby servers are intact.

Thanks,
RKN




Isolation levels on primary and standby

2022-01-13 Thread RKN Sai Krishna
Hello All,

It looks like we could have different isolation levels on primary and
standby servers in the context of replication. If the primary crashes
and a standby server is made as primary, there could be change in
query results because of isolation levels. Is that expected?

Thanks,
RKN




pg_page_repair: a tool/extension to repair corrupted pages in postgres with streaming/physical replication

2022-06-21 Thread RKN Sai Krishna
Hi,

Problem: Today when a data page is corrupted in the primary postgres with
physical replication (sync or async standbys), there seems to be no way to
repair it easily and we rely on PITR to recreate the postgres server or
drop the corrupted table (of course this is not an option for important
customer tables, but may be okay for some maintenance or temporary tables).
PITR is costly to do in a production environment oftentimes as it involves
creation of the full-blown postgres from the base backup and causing
downtime for the customers.

Solution: Make use of the uncorrupted page present in sync or async
standby. The proposed tool/extension pg_page_repair (as we call it) can
fetch the uncorrupted page from sync or async standby and overwrite the
corrupted one on the primary. Yes, there will be a challenge in making sure
that the WAL is replayed completely and standby is up-to-date so that we
are sure that stale pages are not copied across. A simpler idea could be
that the pg_page_repair can wait until the standby replays/catches up with
the primary's flush LSN before fetching the uncorrupted page. A downside of
this approach is that the pg_page_repair waits for long or rather
infinitely if the replication lag is huge. As we all know that the
replication lag is something a good postgres solution will always monitor
to keep it low, if true, the pg_page_repair is guaranteed to not wait for
longer. Another idea could be that the pg_page_repair gets the base page
from the standby and applies all the WAL records pertaining to the
corrupted page using the base page to get the uncorrupted page. This
requires us to pull the replay logic from the core to pg_page_repair which
isn't easy. Hence we propose to go with approach 1, but open to discuss on
approach 2 as well. We suppose that the solution proposed in this thread
holds good even for pages corresponding to indexes.

Implementation Choices: pg_page_repair can either take the corrupted page
info (db id, rel id, block number etc.) or just a relation name and
automatically figure out the corrupted page using pg_checksums for instance
or just database name and automatically figure out all the corrupted pages.
It can either repair the corrupted pages online (only the corrupted table
is inaccessible, the server continues to run) or take downtime if there are
many corrupted pages.

Future Scope: pg_page_repair can be integrated to the core so that the
postgres will repair the pages automatically without manual intervention.

Other Solutions: We did consider an approach where the tool could obtain
the FPI from WAL and replay till the latest WAL record to repair the page.
But there could be limitations such as FPI and related WAL not being
available in primary/archive location.

Thoughts?

Credits (cc-ed): thanks to SATYANARAYANA NARLAPURAM for initial thoughts
and thanks to Bharath Rupireddy, Chen Liang, mahendrakar s and Rohan Kumar
for internal discussions.

Thanks, RKN


pg_receivewal unable to connect to promoted standby

2022-06-24 Thread RKN Sai Krishna
Hi,

I'm trying to have a setup where there is a primary, standby and
pg_receivewal (which acts as a server that maintains the entire WAL).
Quorum is any one of standby and pg_receivewal. In case of primary crash,
when I promote standby (timeline switch from 5 to 6) and restart
pg_receivewal to connect to the promoted standby, I get an error saying
"pg_receivewal: could not send replication command "START_REPLICATION":
ERROR:  requested starting point 16/4C00 on timeline 5 is not in this
server's history. This server's history forked from timeline 5 at
16/4BFFF268".

pg_receivewal latest lsn is 16/4BFFF268 with the timeline id being 5.

Just wondering why is the pg_receivewal requesting the new primary with the
starting point as 16/4C00, even though the latest lsn is 16/4BFFF268.

Is that because of the following code snippet in pg_receivewal by any
chance?

/*
* Move the starting pointer to the start of the next segment, if the
* highest one we saw was completed. Otherwise start streaming from
* the beginning of the .partial segment.
*/
if (!high_ispartial)
high_segno++;

If it is because of the above code, Can we let the pg_receivewal request
the new primary to provide WAL from forked lsn (by asking primary what the
forked lsn and the corresponding timeline are)?

Thanks,
RKN