Hi,
My buildfarm animal grassquit just showed an odd failure [1] in REL_11_STABLE:
ok 10 - standby is in recovery
# Running: pg_ctl -D
/mnt/resource/bf/build/grassquit/REL_11_STABLE/pgsql.build/src/bin/pg_ctl/tmp_check/t_003_promote_standby2_data/pgdata
promote
waiting for server to promote
On Tue, 17 Oct 2023 at 04:18, Thomas Munro wrote:
>
> I pushed the retry-loop-in-frontend-executables patch and the
> missing-locking-in-SQL-functions patch yesterday. That leaves the
> backup ones, which I've rebased and attached, no change. It sounds
> like we need some more healthy debate abo
On Thu, 11 Jan 2024 at 19:50, vignesh C wrote:
>
> On Tue, 17 Oct 2023 at 04:18, Thomas Munro wrote:
> >
> > I pushed the retry-loop-in-frontend-executables patch and the
> > missing-locking-in-SQL-functions patch yesterday. That leaves the
> > backup ones, which I've rebased and attached, no ch
Hello!
On 24.11.2022 04:02, Thomas Munro wrote:
On Thu, Nov 24, 2022 at 11:05 AM Tom Lane wrote:
Thomas Munro writes:
ERROR: calculated CRC checksum does not match value stored in file
The attached draft patch fixes it.
Tried to catch this error on my PC, but failed to do it within a re
On Tue, Jan 31, 2023 at 2:10 PM Anton A. Melnikov wrote:
> Also checked for a few hours that the patch 0002 fixes this error,
> but there are some questions to its logical structure.
Hi Anton,
Thanks for looking!
> The equality between the previous and newly calculated crc is checked only
> if
On Tue, Jan 31, 2023 at 5:09 PM Thomas Munro wrote:
> Clearly there is an element of speculation or superstition here. I
> don't know what else to do if both PostgreSQL and ext4 decided not to
> add interlocking. Maybe we should rethink that. How bad would it
> really be if control file access
Hi, Thomas!
There are two variants of the patch now.
1) As for the first workaround:
On 31.01.2023 07:09, Thomas Munro wrote:
Maybe it's unlikely that two samples will match while running that
torture test, because it's overwriting the file as fast as it can.
But the idea is that a real syste
On Wed, Feb 1, 2023 at 5:04 PM Anton A. Melnikov wrote:
> On 31.01.2023 14:38, Thomas Munro wrote:
> > Here's an experimental patch for that alternative. I wonder if
> > someone would want to be able to turn it off for some reason -- maybe
> > some NFS problem? It's less back-patchable, but mayb
Hi, Thomas!
Thanks for your rapid answer and sorry for my delay with reply.
On 01.02.2023 09:45, Thomas Munro wrote:
Might add a custom error message for EDEADLK
since it absent in errcode_for_file_access()?
Ah, good thought. I think it shouldn't happen™, so it's OK that
errcode_for_file_acc
Hi, Thomas!
On 14.02.2023 06:38, Anton A. Melnikov wrote:
Also i did several experiments with fsync=on and found more appropriate
behavior:
The stress test with sizeof(ControlFileData) = 512+8 = 520 bytes failed in a
4,5 hours,
but the other one with ordinary sizeof(ControlFileData) = 296 not
On Tue, Feb 14, 2023 at 4:38 PM Anton A. Melnikov wrote:
> First of all it seemed to me that is not a problem at all since msdn
> guarantees sector-by-sector atomicity.
> "Physical Sector: The unit for which read and write operations to the device
> are completed in a single operation. This is the
On Fri, Feb 17, 2023 at 4:21 PM Thomas Munro wrote:
> While contemplating what else a mandatory file lock might break, I
> remembered that basebackup.c also reads the control file. Hrmph. Not
> addressed yet; I guess it might need to acquire/release around
> sendFile(sink, XLOG_CONTROL_FILE, ...
Hi, Thomas!
On 17.02.2023 06:21, Thomas Munro wrote:
There are two kinds of atomicity that we rely on for the control file today:
* atomicity on power loss (= device property, in case of overwrite filesystems)
* atomicity of concurrent reads and writes (= VFS or kernel buffer
pool interlocking
On Fri, Feb 24, 2023 at 11:12 PM Anton A. Melnikov wrote:
> On 17.02.2023 06:21, Thomas Munro wrote:
> > BTW there are at least two other places where PostgreSQL already knows
> > that concurrent reads and writes are possibly non-atomic (and we also
> > don't even try to get the alignment right, m
Hi, Thomas!
On 04.03.2023 00:39, Thomas Munro wrote:
It seems a good topic for a separate thread patch. Would you provide a
link to the thread you mentioned please?
https://www.postgresql.org/message-id/flat/367d01a7-90bb-9b70-4cda-248e81cc475c%40cosium.com
Thanks! The important words there:
On Wed, Mar 8, 2023 at 4:43 PM Anton A. Melnikov wrote:
> On 04.03.2023 00:39, Thomas Munro wrote:
> > Could we make better use of the safe copy that we have in the log?
> > Then the pg_backup_start() subproblem would disappear. Conceptually,
> > that'd be just like the way we use FPI for data pa
On 08.03.2023 07:28, Thomas Munro wrote:
Sorry, I was confused; please ignore that part. We don't have a copy
of the control file anywhere else. (Perhaps we should, but that could
be a separate topic.)
That’s all right! Fully agreed that this is a possible separate topic.
Sincerely yours,
-
I'm planning to push 0002 (retries in frontend programs, which is
where this thread began) and 0004 (add missing locks to SQL
functions), including back-patches as far as 12, in a day or so.
I'll abandon the others for now, since we're now thinking bigger[1]
for backups, side stepping the problem.
On Thu, Oct 12, 2023 at 12:25:34PM +1300, Thomas Munro wrote:
> I'm planning to push 0002 (retries in frontend programs, which is
> where this thread began) and 0004 (add missing locks to SQL
> functions), including back-patches as far as 12, in a day or so.
>
> I'll abandon the others for now, si
On 10/11/23 21:10, Michael Paquier wrote:
On Thu, Oct 12, 2023 at 12:25:34PM +1300, Thomas Munro wrote:
I'm planning to push 0002 (retries in frontend programs, which is
where this thread began) and 0004 (add missing locks to SQL
functions), including back-patches as far as 12, in a day or so
On 10/12/23 09:58, David Steele wrote:
On Thu, Oct 12, 2023 at 12:25:34PM +1300, Thomas Munro wrote:
I'm planning to push 0002 (retries in frontend programs, which is
where this thread began) and 0004 (add missing locks to SQL
functions), including back-patches as far as 12, in a day or so.
I'l
On Thu, Oct 12, 2023 at 10:41:39AM -0400, David Steele wrote:
> After some more thought, I think we could massage the "pg_control in
> backup_label" method into something that could be back patched, with more
> advanced features (e.g. error on backup_label and pg_control both present on
> initial c
On 10/12/23 19:15, Michael Paquier wrote:
On Thu, Oct 12, 2023 at 10:41:39AM -0400, David Steele wrote:
After some more thought, I think we could massage the "pg_control in
backup_label" method into something that could be back patched, with more
advanced features (e.g. error on backup_label and
On 10/13/23 10:40, David Steele wrote:
On 10/12/23 19:15, Michael Paquier wrote:
On Thu, Oct 12, 2023 at 10:41:39AM -0400, David Steele wrote:
After some more thought, I think we could massage the "pg_control in
backup_label" method into something that could be back patched, with
more
advanced
I pushed the retry-loop-in-frontend-executables patch and the
missing-locking-in-SQL-functions patch yesterday. That leaves the
backup ones, which I've rebased and attached, no change. It sounds
like we need some more healthy debate about that backup label idea
that would mean we don't need these
On Mon, Oct 16, 2023 at 6:48 PM Thomas Munro wrote:
> I pushed the retry-loop-in-frontend-executables patch and the
> missing-locking-in-SQL-functions patch yesterday. That leaves the
> backup ones, which I've rebased and attached, no change. It sounds
> like we need some more healthy debate abo
On Tue, Oct 17, 2023 at 10:50 AM Robert Haas wrote:
> Life would be a lot easier here if we could get rid of the low-level
> backup API and just have pg_basebackup DTWT, but that seems like a
> completely non-viable proposal.
>
Yeah, my contribution to this area [1] is focusing on the API becaus
On Tue, Nov 22, 2022 at 05:42:24PM -0800, Andres Freund wrote:
> The failure has to be happening in wait_for_postmaster_promote(), because the
> standby2 is actually successfully promoted.
That's the one under -fsanitize=address. It really smells to me like
a bug with a race condition all over it
On 2022-Nov-22, Andres Freund wrote:
> ok 10 - standby is in recovery
> # Running: pg_ctl -D
> /mnt/resource/bf/build/grassquit/REL_11_STABLE/pgsql.build/src/bin/pg_ctl/tmp_check/t_003_promote_standby2_data/pgdata
> promote
> waiting for server to promotepg_ctl: control file appears to be co
On Wed, Nov 23, 2022 at 2:42 PM Andres Freund wrote:
> The failure has to be happening in wait_for_postmaster_promote(), because the
> standby2 is actually successfully promoted.
I assume this is ext4. Presumably anything that reads the
controlfile, like pg_ctl, pg_checksums, pg_resetwal,
pg_con
On Wed, Nov 23, 2022 at 11:03 PM Thomas Munro wrote:
> On Wed, Nov 23, 2022 at 2:42 PM Andres Freund wrote:
> > The failure has to be happening in wait_for_postmaster_promote(), because
> > the
> > standby2 is actually successfully promoted.
>
> I assume this is ext4. Presumably anything that r
Thomas Munro writes:
> On Wed, Nov 23, 2022 at 11:03 PM Thomas Munro wrote:
>> I assume this is ext4. Presumably anything that reads the
>> controlfile, like pg_ctl, pg_checksums, pg_resetwal,
>> pg_control_system(), ... by reading without interlocking against
>> writes could see garbage. I hav
On Thu, Nov 24, 2022 at 11:05 AM Tom Lane wrote:
> Thomas Munro writes:
> > On Wed, Nov 23, 2022 at 11:03 PM Thomas Munro
> > wrote:
> > As for what to do about it, some ideas:
> > 2. Retry after a short time on checksum failure. The probability is
> > already miniscule, and becomes pretty cl
On Thu, Nov 24, 2022 at 2:02 PM Thomas Munro wrote:
> ... and you'll soon see:
>
> ERROR: calculated CRC checksum does not match value stored in file
I forgot to mention: this reproducer only seems to work if fsync =
off. I don't know why, but I recall that was true also for bug
#17064.
This patch no longer applies and needs a rebase.
Given where we are in the commitfest, do you think this patch has the potential
to go in or should it be moved?
--
Daniel Gustafsson
This is a frustrating thread, because despite the last patch solving
most of the problems we discussed, it doesn't address the
low-level-backup procedure in a nice way. We'd have to tell users
they have to flock that file, or add a new step "pg_controldata --raw
> pg_control", which seems weird wh
Greetings,
(Adding David Steele into the CC on this one...)
* Thomas Munro (thomas.mu...@gmail.com) wrote:
> This is a frustrating thread, because despite the last patch solving
> most of the problems we discussed, it doesn't address the
> low-level-backup procedure in a nice way. We'd have to t
On Fri, Jul 21, 2023 at 8:52 PM Thomas Munro wrote:
> Idea for future research: Perhaps pg_backup_stop()'s label-file
> output should include the control file image (suitably encoded)? Then
> the recovery-from-label code could completely ignore the existing
> control file, and overwrite it using
On Tue, Jul 25, 2023 at 6:04 AM Stephen Frost wrote:
> * Thomas Munro (thomas.mu...@gmail.com) wrote:
> > Here's a new minimal patch that solves only the bugs in basebackup +
> > the simple SQL-facing functions that read the control file, by simply
> > acquiring ControlFileLock in the obvious plac
On Tue, Jul 25, 2023 at 8:18 AM Robert Haas wrote:
> (Yeah, I know we have code to verify checksums during a base
> backup, but as discussed elsewhere, it doesn't work.)
BTW the the code you are referring to there seems to think 4KB
page-halves are atomic; not sure if that's imagining page-level
While chatting to Robert and Andres about all this, a new idea came
up. Or, rather, one of the first ideas that was initially rejected,
now resurrected to try out a suggestion of Andres’s on how to
de-pessimise it. Unfortunately, it also suffers from Windows-specific
problems that I originally me
Hi Thomas,
On 7/26/23 06:06, Thomas Munro wrote:
While chatting to Robert and Andres about all this, a new idea came
up. Or, rather, one of the first ideas that was initially rejected,
now resurrected to try out a suggestion of Andres’s on how to
de-pessimise it. Unfortunately, it also suffers
Hello!
On 26.07.2023 07:06, Thomas Munro wrote:
New patches
attached. Are they getting better?
It seems to me that it is worth focusing efforts on the second part of the
patch,
as the most in demand. And try to commit it first.
And seems there is a way to simplify it by adding a parameter
Sorry, attached the wrong version of the file. Here is the right one.
Sincerely yours,
--
Anton A. Melnikov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
alg_level_up.pdf
Description: Adobe PDF document
44 matches
Mail list logo