Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2024-02-01 Thread vignesh C
On Thu, 11 Jan 2024 at 19:50, vignesh C wrote: > > On Tue, 17 Oct 2023 at 04:18, Thomas Munro wrote: > > > > I pushed the retry-loop-in-frontend-executables patch and the > > missing-locking-in-SQL-functions patch yesterday. That leaves the > > backup ones, which I've rebased and attached, no

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2024-01-11 Thread vignesh C
On Tue, 17 Oct 2023 at 04:18, Thomas Munro wrote: > > I pushed the retry-loop-in-frontend-executables patch and the > missing-locking-in-SQL-functions patch yesterday. That leaves the > backup ones, which I've rebased and attached, no change. It sounds > like we need some more healthy debate

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-10-17 Thread David G. Johnston
On Tue, Oct 17, 2023 at 10:50 AM Robert Haas wrote: > Life would be a lot easier here if we could get rid of the low-level > backup API and just have pg_basebackup DTWT, but that seems like a > completely non-viable proposal. > Yeah, my contribution to this area [1] is focusing on the API

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-10-17 Thread Robert Haas
On Mon, Oct 16, 2023 at 6:48 PM Thomas Munro wrote: > I pushed the retry-loop-in-frontend-executables patch and the > missing-locking-in-SQL-functions patch yesterday. That leaves the > backup ones, which I've rebased and attached, no change. It sounds > like we need some more healthy debate

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-10-16 Thread Thomas Munro
I pushed the retry-loop-in-frontend-executables patch and the missing-locking-in-SQL-functions patch yesterday. That leaves the backup ones, which I've rebased and attached, no change. It sounds like we need some more healthy debate about that backup label idea that would mean we don't need

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-10-14 Thread David Steele
On 10/13/23 10:40, David Steele wrote: On 10/12/23 19:15, Michael Paquier wrote: On Thu, Oct 12, 2023 at 10:41:39AM -0400, David Steele wrote: After some more thought, I think we could massage the "pg_control in backup_label" method into something that could be back patched, with more

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-10-13 Thread David Steele
On 10/12/23 19:15, Michael Paquier wrote: On Thu, Oct 12, 2023 at 10:41:39AM -0400, David Steele wrote: After some more thought, I think we could massage the "pg_control in backup_label" method into something that could be back patched, with more advanced features (e.g. error on backup_label

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-10-12 Thread Michael Paquier
On Thu, Oct 12, 2023 at 10:41:39AM -0400, David Steele wrote: > After some more thought, I think we could massage the "pg_control in > backup_label" method into something that could be back patched, with more > advanced features (e.g. error on backup_label and pg_control both present on > initial

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-10-12 Thread David Steele
On 10/12/23 09:58, David Steele wrote: On Thu, Oct 12, 2023 at 12:25:34PM +1300, Thomas Munro wrote: I'm planning to push 0002 (retries in frontend programs, which is where this thread began) and 0004 (add missing locks to SQL functions), including back-patches as far as 12, in a day or so.

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-10-12 Thread David Steele
On 10/11/23 21:10, Michael Paquier wrote: On Thu, Oct 12, 2023 at 12:25:34PM +1300, Thomas Munro wrote: I'm planning to push 0002 (retries in frontend programs, which is where this thread began) and 0004 (add missing locks to SQL functions), including back-patches as far as 12, in a day or

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-10-11 Thread Michael Paquier
On Thu, Oct 12, 2023 at 12:25:34PM +1300, Thomas Munro wrote: > I'm planning to push 0002 (retries in frontend programs, which is > where this thread began) and 0004 (add missing locks to SQL > functions), including back-patches as far as 12, in a day or so. > > I'll abandon the others for now,

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-10-11 Thread Thomas Munro
I'm planning to push 0002 (retries in frontend programs, which is where this thread began) and 0004 (add missing locks to SQL functions), including back-patches as far as 12, in a day or so. I'll abandon the others for now, since we're now thinking bigger[1] for backups, side stepping the

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-07-30 Thread Anton A. Melnikov
Sorry, attached the wrong version of the file. Here is the right one. Sincerely yours, -- Anton A. Melnikov Postgres Professional: http://www.postgrespro.com The Russian Postgres Company alg_level_up.pdf Description: Adobe PDF document

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-07-30 Thread Anton A. Melnikov
Hello! On 26.07.2023 07:06, Thomas Munro wrote: New patches attached. Are they getting better? It seems to me that it is worth focusing efforts on the second part of the patch, as the most in demand. And try to commit it first. And seems there is a way to simplify it by adding a parameter

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-07-27 Thread David Steele
Hi Thomas, On 7/26/23 06:06, Thomas Munro wrote: While chatting to Robert and Andres about all this, a new idea came up. Or, rather, one of the first ideas that was initially rejected, now resurrected to try out a suggestion of Andres’s on how to de-pessimise it. Unfortunately, it also

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-07-25 Thread Thomas Munro
While chatting to Robert and Andres about all this, a new idea came up. Or, rather, one of the first ideas that was initially rejected, now resurrected to try out a suggestion of Andres’s on how to de-pessimise it. Unfortunately, it also suffers from Windows-specific problems that I originally

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-07-24 Thread Thomas Munro
On Tue, Jul 25, 2023 at 8:18 AM Robert Haas wrote: > (Yeah, I know we have code to verify checksums during a base > backup, but as discussed elsewhere, it doesn't work.) BTW the the code you are referring to there seems to think 4KB page-halves are atomic; not sure if that's imagining page-level

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-07-24 Thread Thomas Munro
On Tue, Jul 25, 2023 at 6:04 AM Stephen Frost wrote: > * Thomas Munro (thomas.mu...@gmail.com) wrote: > > Here's a new minimal patch that solves only the bugs in basebackup + > > the simple SQL-facing functions that read the control file, by simply > > acquiring ControlFileLock in the obvious

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-07-24 Thread Robert Haas
On Fri, Jul 21, 2023 at 8:52 PM Thomas Munro wrote: > Idea for future research: Perhaps pg_backup_stop()'s label-file > output should include the control file image (suitably encoded)? Then > the recovery-from-label code could completely ignore the existing > control file, and overwrite it

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-07-24 Thread Stephen Frost
Greetings, (Adding David Steele into the CC on this one...) * Thomas Munro (thomas.mu...@gmail.com) wrote: > This is a frustrating thread, because despite the last patch solving > most of the problems we discussed, it doesn't address the > low-level-backup procedure in a nice way. We'd have to

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-07-21 Thread Thomas Munro
This is a frustrating thread, because despite the last patch solving most of the problems we discussed, it doesn't address the low-level-backup procedure in a nice way. We'd have to tell users they have to flock that file, or add a new step "pg_controldata --raw > pg_control", which seems weird

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-07-20 Thread Daniel Gustafsson
This patch no longer applies and needs a rebase. Given where we are in the commitfest, do you think this patch has the potential to go in or should it be moved? -- Daniel Gustafsson

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-03-08 Thread Anton A. Melnikov
On 08.03.2023 07:28, Thomas Munro wrote: Sorry, I was confused; please ignore that part. We don't have a copy of the control file anywhere else. (Perhaps we should, but that could be a separate topic.) That’s all right! Fully agreed that this is a possible separate topic. Sincerely yours,

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-03-07 Thread Thomas Munro
On Wed, Mar 8, 2023 at 4:43 PM Anton A. Melnikov wrote: > On 04.03.2023 00:39, Thomas Munro wrote: > > Could we make better use of the safe copy that we have in the log? > > Then the pg_backup_start() subproblem would disappear. Conceptually, > > that'd be just like the way we use FPI for data

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-03-07 Thread Anton A. Melnikov
Hi, Thomas! On 04.03.2023 00:39, Thomas Munro wrote: It seems a good topic for a separate thread patch. Would you provide a link to the thread you mentioned please? https://www.postgresql.org/message-id/flat/367d01a7-90bb-9b70-4cda-248e81cc475c%40cosium.com Thanks! The important words

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-03-03 Thread Thomas Munro
On Fri, Feb 24, 2023 at 11:12 PM Anton A. Melnikov wrote: > On 17.02.2023 06:21, Thomas Munro wrote: > > BTW there are at least two other places where PostgreSQL already knows > > that concurrent reads and writes are possibly non-atomic (and we also > > don't even try to get the alignment right,

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-02-24 Thread Anton A. Melnikov
Hi, Thomas! On 17.02.2023 06:21, Thomas Munro wrote: There are two kinds of atomicity that we rely on for the control file today: * atomicity on power loss (= device property, in case of overwrite filesystems) * atomicity of concurrent reads and writes (= VFS or kernel buffer pool interlocking

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-02-21 Thread Thomas Munro
On Fri, Feb 17, 2023 at 4:21 PM Thomas Munro wrote: > While contemplating what else a mandatory file lock might break, I > remembered that basebackup.c also reads the control file. Hrmph. Not > addressed yet; I guess it might need to acquire/release around > sendFile(sink, XLOG_CONTROL_FILE,

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-02-16 Thread Thomas Munro
On Tue, Feb 14, 2023 at 4:38 PM Anton A. Melnikov wrote: > First of all it seemed to me that is not a problem at all since msdn > guarantees sector-by-sector atomicity. > "Physical Sector: The unit for which read and write operations to the device > are completed in a single operation. This is

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-02-14 Thread Anton A. Melnikov
Hi, Thomas! On 14.02.2023 06:38, Anton A. Melnikov wrote: Also i did several experiments with fsync=on and found more appropriate behavior: The stress test with sizeof(ControlFileData) = 512+8 = 520 bytes failed in a 4,5 hours, but the other one with ordinary sizeof(ControlFileData) = 296 not

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-02-13 Thread Anton A. Melnikov
Hi, Thomas! Thanks for your rapid answer and sorry for my delay with reply. On 01.02.2023 09:45, Thomas Munro wrote: Might add a custom error message for EDEADLK since it absent in errcode_for_file_access()? Ah, good thought. I think it shouldn't happen™, so it's OK that

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-01-31 Thread Thomas Munro
On Wed, Feb 1, 2023 at 5:04 PM Anton A. Melnikov wrote: > On 31.01.2023 14:38, Thomas Munro wrote: > > Here's an experimental patch for that alternative. I wonder if > > someone would want to be able to turn it off for some reason -- maybe > > some NFS problem? It's less back-patchable, but

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-01-31 Thread Anton A. Melnikov
Hi, Thomas! There are two variants of the patch now. 1) As for the first workaround: On 31.01.2023 07:09, Thomas Munro wrote: Maybe it's unlikely that two samples will match while running that torture test, because it's overwriting the file as fast as it can. But the idea is that a real

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-01-31 Thread Thomas Munro
On Tue, Jan 31, 2023 at 5:09 PM Thomas Munro wrote: > Clearly there is an element of speculation or superstition here. I > don't know what else to do if both PostgreSQL and ext4 decided not to > add interlocking. Maybe we should rethink that. How bad would it > really be if control file access

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-01-30 Thread Thomas Munro
On Tue, Jan 31, 2023 at 2:10 PM Anton A. Melnikov wrote: > Also checked for a few hours that the patch 0002 fixes this error, > but there are some questions to its logical structure. Hi Anton, Thanks for looking! > The equality between the previous and newly calculated crc is checked only > if

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2023-01-30 Thread Anton A. Melnikov
Hello! On 24.11.2022 04:02, Thomas Munro wrote: On Thu, Nov 24, 2022 at 11:05 AM Tom Lane wrote: Thomas Munro writes: ERROR: calculated CRC checksum does not match value stored in file The attached draft patch fixes it. Tried to catch this error on my PC, but failed to do it within a

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2022-11-23 Thread Thomas Munro
On Thu, Nov 24, 2022 at 2:02 PM Thomas Munro wrote: > ... and you'll soon see: > > ERROR: calculated CRC checksum does not match value stored in file I forgot to mention: this reproducer only seems to work if fsync = off. I don't know why, but I recall that was true also for bug #17064.

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2022-11-23 Thread Thomas Munro
On Thu, Nov 24, 2022 at 11:05 AM Tom Lane wrote: > Thomas Munro writes: > > On Wed, Nov 23, 2022 at 11:03 PM Thomas Munro > > wrote: > > As for what to do about it, some ideas: > > 2. Retry after a short time on checksum failure. The probability is > > already miniscule, and becomes pretty

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2022-11-23 Thread Tom Lane
Thomas Munro writes: > On Wed, Nov 23, 2022 at 11:03 PM Thomas Munro wrote: >> I assume this is ext4. Presumably anything that reads the >> controlfile, like pg_ctl, pg_checksums, pg_resetwal, >> pg_control_system(), ... by reading without interlocking against >> writes could see garbage. I

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2022-11-23 Thread Thomas Munro
On Wed, Nov 23, 2022 at 11:03 PM Thomas Munro wrote: > On Wed, Nov 23, 2022 at 2:42 PM Andres Freund wrote: > > The failure has to be happening in wait_for_postmaster_promote(), because > > the > > standby2 is actually successfully promoted. > > I assume this is ext4. Presumably anything that

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2022-11-23 Thread Thomas Munro
On Wed, Nov 23, 2022 at 2:42 PM Andres Freund wrote: > The failure has to be happening in wait_for_postmaster_promote(), because the > standby2 is actually successfully promoted. I assume this is ext4. Presumably anything that reads the controlfile, like pg_ctl, pg_checksums, pg_resetwal,

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2022-11-23 Thread Alvaro Herrera
On 2022-Nov-22, Andres Freund wrote: > ok 10 - standby is in recovery > # Running: pg_ctl -D > /mnt/resource/bf/build/grassquit/REL_11_STABLE/pgsql.build/src/bin/pg_ctl/tmp_check/t_003_promote_standby2_data/pgdata > promote > waiting for server to promotepg_ctl: control file appears to be

Re: odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2022-11-22 Thread Michael Paquier
On Tue, Nov 22, 2022 at 05:42:24PM -0800, Andres Freund wrote: > The failure has to be happening in wait_for_postmaster_promote(), because the > standby2 is actually successfully promoted. That's the one under -fsanitize=address. It really smells to me like a bug with a race condition all over

odd buildfarm failure - "pg_ctl: control file appears to be corrupt"

2022-11-22 Thread Andres Freund
Hi, My buildfarm animal grassquit just showed an odd failure [1] in REL_11_STABLE: ok 10 - standby is in recovery # Running: pg_ctl -D /mnt/resource/bf/build/grassquit/REL_11_STABLE/pgsql.build/src/bin/pg_ctl/tmp_check/t_003_promote_standby2_data/pgdata promote waiting for server to