Re: race condition when writing pg_control

2024-05-17 Thread Thomas Munro
On Fri, May 17, 2024 at 4:46 PM Thomas Munro wrote: > The specific problem here is that LocalProcessControlFile() runs in > every launched child for EXEC_BACKEND builds. Windows uses > EXEC_BACKEND, and Windows' NTFS file system is one of the two file > systems known to this list to have the

Re: race condition when writing pg_control

2024-05-16 Thread Thomas Munro
The specific problem here is that LocalProcessControlFile() runs in every launched child for EXEC_BACKEND builds. Windows uses EXEC_BACKEND, and Windows' NTFS file system is one of the two file systems known to this list to have the concurrent read/write data mashing problem (the other being

Re: race condition when writing pg_control

2024-05-16 Thread Andres Freund
Hi, On 2024-05-16 15:01:31 -0400, Tom Lane wrote: > Andres Freund writes: > > On 2024-05-16 14:50:50 -0400, Tom Lane wrote: > >> The intention was certainly always that it be atomic. If it isn't > >> we have got *big* trouble. > > > We unfortunately do *know* that on several systems e.g.

Re: race condition when writing pg_control

2024-05-16 Thread Tom Lane
Andres Freund writes: > On 2024-05-16 14:50:50 -0400, Tom Lane wrote: >> The intention was certainly always that it be atomic. If it isn't >> we have got *big* trouble. > We unfortunately do *know* that on several systems e.g. basebackup can read a > partially written control file, while the

Re: race condition when writing pg_control

2024-05-16 Thread Andres Freund
Hi, On 2024-05-16 14:50:50 -0400, Tom Lane wrote: > Nathan Bossart writes: > > I suspect it will be difficult to investigate this one too much further > > unless we can track down a copy of the control file with the bad checksum. > > Other than searching for any new code that isn't doing the

Re: race condition when writing pg_control

2024-05-16 Thread Tom Lane
Nathan Bossart writes: > I suspect it will be difficult to investigate this one too much further > unless we can track down a copy of the control file with the bad checksum. > Other than searching for any new code that isn't doing the appropriate > locking, maybe we could search the buildfarm for

Re: race condition when writing pg_control

2024-05-16 Thread Nathan Bossart
On Thu, May 16, 2024 at 12:19:22PM -0400, Melanie Plageman wrote: > Today, after committing a3e6c6f, I saw recovery/018_wal_optimize.pl > fail and see this message in the replica log [2]. > > 2024-05-16 15:12:22.821 GMT [5440][not initialized] FATAL: incorrect > checksum in control file > > I'm

Re: race condition when writing pg_control

2024-05-16 Thread Melanie Plageman
On Sun, Jun 7, 2020 at 10:49 PM Thomas Munro wrote: > > On Wed, Jun 3, 2020 at 2:03 PM Michael Paquier wrote: > > On Wed, Jun 03, 2020 at 10:56:13AM +1200, Thomas Munro wrote: > > > Sorry for my radio silence, I got tangled up with a couple of > > > conferences. I'm planning to look at 0001 and

Re: race condition when writing pg_control

2020-06-08 Thread amul sul
On Fri, May 29, 2020 at 12:54 PM Fujii Masao wrote: > > > On 2020/05/27 16:10, Michael Paquier wrote: > > On Tue, May 26, 2020 at 07:30:54PM +, Bossart, Nathan wrote: > >> While an assertion in UpdateControlFile() would not have helped us > >> catch the problem I initially reported, it does

Re: race condition when writing pg_control

2020-06-08 Thread Michael Paquier
On Mon, Jun 08, 2020 at 03:25:31AM +, Bossart, Nathan wrote: > On 6/7/20, 7:50 PM, "Thomas Munro" wrote: >> I pushed 0001 and 0002, squashed into one commit. I'm not sure about >> 0003. If we're going to do that, wouldn't it be better to just >> acquire the lock in that one extra place in

Re: race condition when writing pg_control

2020-06-07 Thread Bossart, Nathan
On 6/7/20, 7:50 PM, "Thomas Munro" wrote: > I pushed 0001 and 0002, squashed into one commit. I'm not sure about > 0003. If we're going to do that, wouldn't it be better to just > acquire the lock in that one extra place in StartupXLOG(), rather than > introducing the extra parameter? Thanks!

Re: race condition when writing pg_control

2020-06-07 Thread Thomas Munro
On Wed, Jun 3, 2020 at 2:03 PM Michael Paquier wrote: > On Wed, Jun 03, 2020 at 10:56:13AM +1200, Thomas Munro wrote: > > Sorry for my radio silence, I got tangled up with a couple of > > conferences. I'm planning to look at 0001 and 0002 shortly. > > Thanks! I pushed 0001 and 0002, squashed

Re: race condition when writing pg_control

2020-06-02 Thread Michael Paquier
On Wed, Jun 03, 2020 at 10:56:13AM +1200, Thomas Munro wrote: > Sorry for my radio silence, I got tangled up with a couple of > conferences. I'm planning to look at 0001 and 0002 shortly. Thanks! -- Michael signature.asc Description: PGP signature

Re: race condition when writing pg_control

2020-06-02 Thread Thomas Munro
On Tue, Jun 2, 2020 at 5:24 PM Michael Paquier wrote: > On Sun, May 31, 2020 at 09:11:35PM +, Bossart, Nathan wrote: > > Thanks for the feedback. I've attached a new set of patches. > > Thanks for splitting the set. 0001 and 0002 are the minimum set for > back-patching, and it would be

Re: race condition when writing pg_control

2020-06-01 Thread Michael Paquier
On Sun, May 31, 2020 at 09:11:35PM +, Bossart, Nathan wrote: > Thanks for the feedback. I've attached a new set of patches. Thanks for splitting the set. 0001 and 0002 are the minimum set for back-patching, and it would be better to merge them together. 0003 is debatable and not an actual

Re: race condition when writing pg_control

2020-05-29 Thread Fujii Masao
On 2020/05/27 16:10, Michael Paquier wrote: On Tue, May 26, 2020 at 07:30:54PM +, Bossart, Nathan wrote: While an assertion in UpdateControlFile() would not have helped us catch the problem I initially reported, it does seem worthwhile to add it. I have attached a patch that adds this

Re: race condition when writing pg_control

2020-05-27 Thread Michael Paquier
On Tue, May 26, 2020 at 07:30:54PM +, Bossart, Nathan wrote: > While an assertion in UpdateControlFile() would not have helped us > catch the problem I initially reported, it does seem worthwhile to add > it. I have attached a patch that adds this assertion and also > attempts to fix

Re: race condition when writing pg_control

2020-05-26 Thread Bossart, Nathan
On 5/21/20, 9:52 PM, "Thomas Munro" wrote: > Here's a version with a commit message added. I'll push this to all > releases in a day or two if there are no objections. Looks good to me. Thanks! Nathan

Re: race condition when writing pg_control

2020-05-22 Thread Michael Paquier
On Sat, May 23, 2020 at 01:00:17AM +0900, Fujii Masao wrote: > Per my quick check, XLogReportParameters() seems to have the similar issue, > i.e., it updates the control file without taking ControlFileLock. > Maybe we should fix this at the same time? Yeah. It also checks the control file

Re: race condition when writing pg_control

2020-05-22 Thread Fujii Masao
On 2020/05/22 13:51, Thomas Munro wrote: On Tue, May 5, 2020 at 9:51 AM Thomas Munro wrote: On Tue, May 5, 2020 at 5:53 AM Bossart, Nathan wrote: I believe I've discovered a race condition between the startup and checkpointer processes that can cause a CRC mismatch in the pg_control file.

Re: race condition when writing pg_control

2020-05-21 Thread Thomas Munro
On Tue, May 5, 2020 at 9:51 AM Thomas Munro wrote: > On Tue, May 5, 2020 at 5:53 AM Bossart, Nathan wrote: > > I believe I've discovered a race condition between the startup and > > checkpointer processes that can cause a CRC mismatch in the pg_control > > file. If a cluster crashes at the

Re: race condition when writing pg_control

2020-05-04 Thread Thomas Munro
On Tue, May 5, 2020 at 5:53 AM Bossart, Nathan wrote: > I believe I've discovered a race condition between the startup and > checkpointer processes that can cause a CRC mismatch in the pg_control > file. If a cluster crashes at the right time, the following error > appears when you attempt to