Standby got invalid primary checkpoint after crashed right after promoted.

hao harry Wed, 16 Mar 2022 00:16:42 -0700

Hi, pgsql-hackers,

I think I found a case that database is not recoverable, would you please give 
a look?


Here is how it happens:

- setup primary/standby
- do a lots INSERT at primary
- create a checkpoint at primary
- wait until standby start doing restart point, it take about 3mins syncing 
buffers to complete
- before the restart point update ControlFile, promote the standby, that 
changed ControlFile
  ->state to DB_IN_PRODUCTION, this will skip update to ControlFile, leaving 
the ControlFile
  ->checkPoint pointing to a removed file
- before the promoted standby request the post-recovery checkpoint (fast 
promoted), 
  one backend crashed, it will kill other server process, so the post-recovery 
checkpoint skipped
- the database restart startup process, which report: "could not locate a valid 
checkpoint record"

I attached a test to reproduce it, it does not fail every time, it fails every 
10 times to me.
To increase the chance CreateRestartPoint skip update ControlFile and to 
simulate a crash,
the patch 0001 is needed.

Best Regard.

Harry Hao

0001-Patched-CreateRestartPoint-to-reproduce-invalid-chec.patch
Description: 0001-Patched-CreateRestartPoint-to-reproduce-invalid-chec.patch

# Copyright (c) 2021-2022, PostgreSQL Global Development Group

# This test reproduces a crash after promotion caused error,
# log says: "could not locate a valid checkpoint record".
use strict;
use warnings;
use PostgreSQL::Test::Cluster;
use PostgreSQL::Test::Utils;
use Test::More;

# Initialize a primary
my $alpha = PostgreSQL::Test::Cluster->new('alpha');
$alpha->init(allows_streaming => 1);
$alpha->start;

# To simulate a backend crash
my $regress_shlib = $ENV{REGRESS_SHLIB};
$alpha->safe_psql('postgres', <<EOSQL);
CREATE FUNCTION chaos_sigsegv()
   RETURNS void
   AS '$regress_shlib'
   LANGUAGE C STRICT;
EOSQL

# Initialize a standby
$alpha->backup('bkp');
my $bravo = PostgreSQL::Test::Cluster->new('bravo');
$bravo->init_from_backup($alpha, 'bkp', has_streaming => 1);
$bravo->append_conf('postgresql.conf', <<EOF);
log_checkpoints=true
log_min_messages=DEBUG
checkpoint_timeout=1h
max_wal_size=100GB
restart_after_crash=true
EOF
$bravo->start;

# Dummy table for the upcoming tests.
$alpha->safe_psql('postgres', 'create table test1 (a int)');
$alpha->safe_psql('postgres',
	'insert into test1 select generate_series(1, 1000000)');

# Take a checkpoint
$alpha->safe_psql('postgres', 'checkpoint');

my $in  = '';
my $out = '';
my $timer = IPC::Run::timeout(180);
my $h = $bravo->background_psql('postgres', \$in, \$out, $timer,
    on_error_stop => 0);
$in .= q{
checkpoint;
};

my $in2  = '';
my $out2 = '';
my $h2 = $bravo->background_psql('postgres', \$in2, \$out2, $timer,
     on_error_stop => 0);
$in2 .= q {
select pg_sleep(0.03);
select pg_promote();
};

# Force a restartpoint, patched to sleep a while before checking
# ControlFile->state.
$h->pump_nb;

# Promote the standby, to set ControlFile->state to DB_IN_PRODUCTION
$h2->pump_nb;

# Simulate a crash, to skip the post-recovery checkpoint.
$bravo->psql('postgres', 'select pg_sleep(0.001)');
$bravo->psql('postgres', 'select chaos_sigsegv()');


# Check the log to see if database recovered from last crash.
my $logfile = slurp_file($bravo->logfile());
ok( $logfile !~ 'invalid primary checkpoint', 
    'should recover from last checkpoint');

done_testing();

Standby got invalid primary checkpoint after crashed right after promoted.

Reply via email to