On 2022-09-29 17:18, Polina Bungina wrote:
I agree with your suggestions, so here is the updated version of
patch. Hope I haven't missed anything.

Regards,
Polina Bungina

Thanks for working on this!
It seems like we are also facing the same issue.

I tested the v3 patch under our condition, old primary has succeeded to become new standby.


BTW when I used pg_rewind-removes-wal-segments-reproduce.sh attached in [1], old primary also failed to become standby:

FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000020000000000000007 has already been removed

However, I think this is not a problem: just adding restore_command like below fixed the situation.

echo "restore_command = '/bin/cp `pwd`/newarch/%f %p'" >> oldprim/postgresql.conf

Attached modified reproduction script for reference.

[1]https://www.postgresql.org/message-id/CAFh8B%3DnNiFZOAPsv49gffxHBPzwmZ%3D6Msd4miMis87K%3Dd9rcRA%40mail.gmail.com


--
Regards,

--
Atsushi Torikoshi
NTT DATA CORPORATION
mkdir newarch oldarch
initdb -k -D oldprim
echo "archive_mode = 'on'">> oldprim/postgresql.conf
echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/oldarch/%f'">> 
oldprim/postgresql.conf
pg_ctl -D oldprim -o '-p 5432' -l oldprim.log start
psql -p 5432 -c 'create table t(a int)'
pg_basebackup -D newprim -p 5432
echo "primary_conninfo='host=/tmp port=5432'">> newprim/postgresql.conf
echo "archive_command = 'echo "archive %f" >&2; cp %p `pwd`/newarch/%f'">> 
newprim/postgresql.conf
touch newprim/standby.signal
pg_ctl -D newprim -o '-p 5433' -l newprim.log start

# the last common checkpoint
psql -p 5432 -c 'checkpoint'

# old primary cannot archive any more
echo "archive_command = 'false'">> oldprim/postgresql.conf
pg_ctl -D oldprim reload
# advance WAL on the old primary; four WAL segments will never make it to the 
archive
for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select 
pg_switch_wal();'; done

# record approx. diverging WAL segment
start_wal=`psql -p 5432 -Atc "select pg_walfile_name(pg_last_wal_replay_lsn() - 
(select setting from pg_settings where name = 'wal_segment_size')::int);"`
pg_ctl -D newprim promote

# old rprimary loses diverging WAL segment
for i in $(seq 1 4); do psql -p 5432 -c 'insert into t values(0); select 
pg_switch_wal();'; done
psql -p 5432 -c 'checkpoint;'
psql -p 5433 -c 'checkpoint;'

pg_ctl -D oldprim stop

# rewind the old primary, using its own archive
# pg_rewind -D oldprim --source-server='port=5433' # should fail
echo "restore_command = 'echo "restore %f" >&2; cp `pwd`/oldarch/%f %p'">> 
oldprim/postgresql.conf
pg_rewind -D oldprim --source-server='port=5433' -c

# advance WAL on the old primary; new primary loses the launching WAL seg
for i in $(seq 1 4); do psql -p 5433 -c 'insert into t values(0); select 
pg_switch_wal();'; done
psql -p 5433 -c 'checkpoint'
echo "primary_conninfo='host=/tmp port=5433'">> oldprim/postgresql.conf
touch oldprim/standby.signal
echo "restore_command = '/bin/cp `pwd`/newarch/%f %p'" >> 
oldprim/postgresql.conf

postgres -D oldprim  # fails with "WAL file has been removed"

## The alternative of copying-in
## echo "restore_command = 'echo "restore %f" >&2; cp `pwd`/newarch/%f %p'">> 
oldprim/postgresql.conf
#
## copy-in WAL files from new primary's archive to old primary
#(cd newarch;
#for f in `ls`; do
#  if [[ "$f" > "$start_wal" ]]; then echo copy $f; cp $f ../oldprim/pg_wal; fi
#done)
#
#postgres -D oldprim  # also fails with "requested WAL segment XXX has already 
been removed"

Reply via email to