On Thu, 26 Jun 2025 at 06:22, Michael Paquier <mich...@paquier.xyz> wrote: > > On Wed, Jun 25, 2025 at 10:19:55PM +0530, vignesh C wrote: > > Currently, the logic attempts to read the complete WAL record based on > > the size obtained before the crash—even though only a partial record > > was written. It then checks the page header to determine whether the > > XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set only after reading the > > complete WAL record at XLogDecodeNextRecord function, but since that > > much WAL data was not available in the system we never get a chance to > > check the header after this.. To address this issue, a more robust > > approach would be to first read the page header, check if the > > XLP_FIRST_IS_OVERWRITE_CONTRECORD flag is set, and only then proceed > > to read the WAL record size if the record is not marked as a partial > > overwrite. This would prevent the system from waiting for WAL data > > that will never arrive. Attached partial_wal_record_fix.patch patch > > for this. > > So you are suggesting the addition of an extra ReadPageInternal() that > forces a read of only the read, perform the checks on the header, then > read the rest. After reading SizeOfXLogShortPHD worth of data, > shouldn't the checks on xlp_rem_len be done a bit earlier than what > you are proposing in this patch?
Modified > > I don't have a consistent test to reproduce this issue, currently this > > issue can be reproduced by running the 046_checkpoint_logical_slot > > test about 50 times. This test 046_checkpoint_logical_slot was > > reverted recently after it caused a few buildfarm failures discussed > > at [2]. The same test is also attached as > > reverted_test_046_checkpoint_logical_slot.patch. > > Seeing the noise that this originally created in the CFbot and the > buildfarm, even if the issue would be only triggered after a timeout, > I'd vote for relying on this test being good enough for the purpose of > this race condition. I agree Another reliable approach would be to make the > code wait before reading the record in the internal loop of > ReadPageInternal() with an injection point when we know that we have a > contrecord, but I'm not really excited about this prospect in > xlogreader.c which can be also used in the frontend. I could not find an easy way to simulate this scenario through injection point, I felt 046_checkpoint_logical_slot.pl should be enough. The attached v2 version patch has the changes for the same. Regards, Vignesh
From 9029482c8dca9762b016743ca693b8e50396f253 Mon Sep 17 00:00:00 2001 From: Vignesh C <vignes...@gmail.com> Date: Thu, 26 Jun 2025 16:36:40 +0530 Subject: [PATCH v2] Fix infinite wait when reading partially written WAL record after crash If a crash occurs while writing a WAL record that spans multiple pages, the recovery process marks the page with the XLP_FIRST_IS_OVERWRITE_CONTRECORD flag. However, logical decoding currently attempts to read the full WAL record based on its expected size before checking this flag, which can lead to an infinite wait if the remaining data is never written (e.g., no activity after crash). This patch updates the logic to first read the page header and check for the XLP_FIRST_IS_OVERWRITE_CONTRECORD flag before attempting to reconstruct the full WAL record. If the flag is set, decoding correctly identifies the record as incomplete and avoids waiting for WAL data that will never arrive. --- src/backend/access/transam/xlogreader.c | 22 ++- src/test/recovery/meson.build | 1 + .../recovery/t/046_checkpoint_logical_slot.pl | 136 ++++++++++++++++++ 3 files changed, 154 insertions(+), 5 deletions(-) create mode 100644 src/test/recovery/t/046_checkpoint_logical_slot.pl diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index 2790ade1f91..cba00f6149b 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -723,11 +723,14 @@ restart: /* Calculate pointer to beginning of next page */ targetPagePtr += XLOG_BLCKSZ; - /* Wait for the next page to become available */ - readOff = ReadPageInternal(state, targetPagePtr, - Min(total_len - gotlen + SizeOfXLogShortPHD, - XLOG_BLCKSZ)); - + /* + * Read the page header before processing the WAL record data. This + * is necessary to correctly handle cases where the previous record + * ended as a partial record. Attempting to read the full + * record without checking the header may result in waiting + * indefinitely for data that doesn't exist. + */ + readOff = ReadPageInternal(state, targetPagePtr, SizeOfXLogShortPHD); if (readOff == XLREAD_WOULDBLOCK) return XLREAD_WOULDBLOCK; else if (readOff < 0) @@ -775,6 +778,15 @@ restart: LSN_FORMAT_ARGS(RecPtr)); goto err; } + + /* Wait for the next page to become available */ + readOff = ReadPageInternal(state, targetPagePtr, + Min(total_len - gotlen + SizeOfXLogShortPHD, + XLOG_BLCKSZ)); + if (readOff == XLREAD_WOULDBLOCK) + return XLREAD_WOULDBLOCK; + else if (readOff < 0) + goto err; /* Append the continuation from this page to the buffer */ pageHeaderSize = XLogPageHeaderSize(pageHeader); diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build index 6e78ff1a030..52993c32dbb 100644 --- a/src/test/recovery/meson.build +++ b/src/test/recovery/meson.build @@ -54,6 +54,7 @@ tests += { 't/043_no_contrecord_switch.pl', 't/044_invalidate_inactive_slots.pl', 't/045_archive_restartpoint.pl', + 't/046_checkpoint_logical_slot.pl', 't/047_checkpoint_physical_slot.pl', 't/048_vacuum_horizon_floor.pl' ], diff --git a/src/test/recovery/t/046_checkpoint_logical_slot.pl b/src/test/recovery/t/046_checkpoint_logical_slot.pl new file mode 100644 index 00000000000..d67c5108d78 --- /dev/null +++ b/src/test/recovery/t/046_checkpoint_logical_slot.pl @@ -0,0 +1,136 @@ +# Copyright (c) 2025, PostgreSQL Global Development Group +# +# This test verifies the case when the logical slot is advanced during +# checkpoint. The test checks that the logical slot's restart_lsn still refers +# to an existed WAL segment after immediate restart. +# +use strict; +use warnings FATAL => 'all'; + +use PostgreSQL::Test::Cluster; +use PostgreSQL::Test::Utils; + +use Test::More; + +if ($ENV{enable_injection_points} ne 'yes') +{ + plan skip_all => 'Injection points not supported by this build'; +} + +my ($node, $result); + +$node = PostgreSQL::Test::Cluster->new('mike'); +$node->init; +$node->append_conf('postgresql.conf', "wal_level = 'logical'"); +$node->start; + +# Check if the extension injection_points is available, as it may be +# possible that this script is run with installcheck, where the module +# would not be installed by default. +if (!$node->check_extension('injection_points')) +{ + plan skip_all => 'Extension injection_points not installed'; +} + +$node->safe_psql('postgres', q(CREATE EXTENSION injection_points)); + +# Create the two slots we'll need. +$node->safe_psql('postgres', + q{select pg_create_logical_replication_slot('slot_logical', 'test_decoding')} +); +$node->safe_psql('postgres', + q{select pg_create_physical_replication_slot('slot_physical', true)}); + +# Advance both slots to the current position just to have everything "valid". +$node->safe_psql('postgres', + q{select count(*) from pg_logical_slot_get_changes('slot_logical', null, null)} +); +$node->safe_psql('postgres', + q{select pg_replication_slot_advance('slot_physical', pg_current_wal_lsn())} +); + +# Run checkpoint to flush current state to disk and set a baseline. +$node->safe_psql('postgres', q{checkpoint}); + +# Generate some transactions to get RUNNING_XACTS. +my $xacts = $node->background_psql('postgres'); +$xacts->query_until( + qr/run_xacts/, + q(\echo run_xacts +SELECT 1 \watch 0.1 +\q +)); + +$node->advance_wal(20); + +# Run another checkpoint to set a new restore LSN. +$node->safe_psql('postgres', q{checkpoint}); + +$node->advance_wal(20); + +# Run another checkpoint, this time in the background, and make it wait +# on the injection point) so that the checkpoint stops right before +# removing old WAL segments. +note('starting checkpoint'); + +my $checkpoint = $node->background_psql('postgres'); +$checkpoint->query_safe( + q(select injection_points_attach('checkpoint-before-old-wal-removal','wait')) +); +$checkpoint->query_until( + qr/starting_checkpoint/, + q(\echo starting_checkpoint +checkpoint; +\q +)); + +# Wait until the checkpoint stops right before removing WAL segments. +note('waiting for injection_point'); +$node->wait_for_event('checkpointer', 'checkpoint-before-old-wal-removal'); +note('injection_point is reached'); + +# Try to advance the logical slot, but make it stop when it moves to the next +# WAL segment (this has to happen in the background, too). +my $logical = $node->background_psql('postgres'); +$logical->query_safe( + q{select injection_points_attach('logical-replication-slot-advance-segment','wait');} +); +$logical->query_until( + qr/get_changes/, + q( +\echo get_changes +select count(*) from pg_logical_slot_get_changes('slot_logical', null, null) \watch 1 +\q +)); + +# Wait until the slot's restart_lsn points to the next WAL segment. +note('waiting for injection_point'); +$node->wait_for_event('client backend', + 'logical-replication-slot-advance-segment'); +note('injection_point is reached'); + +# OK, we're in the right situation: time to advance the physical slot, which +# recalculates the required LSN, and then unblock the checkpoint, which +# removes the WAL still needed by the logical slot. +$node->safe_psql('postgres', + q{select pg_replication_slot_advance('slot_physical', pg_current_wal_lsn())} +); + +# Continue the checkpoint. +$node->safe_psql('postgres', + q{select injection_points_wakeup('checkpoint-before-old-wal-removal')}); + +# Abruptly stop the server (1 second should be enough for the checkpoint +# to finish; it would be better). +$node->stop('immediate'); + +$node->start; + +eval { + $node->safe_psql('postgres', + q{select count(*) from pg_logical_slot_get_changes('slot_logical', null, null);} + ); +}; +is($@, '', "Logical slot still valid"); + +done_testing(); -- 2.43.0