Re: VM corruption on standby

2025-09-03 Thread Alexander Korotkov
On Wed, Sep 3, 2025 at 9:47 AM Andrey Borodin wrote: > > > On 3 Sep 2025, at 11:37, Alexander Korotkov wrote: > > > > Could you, please, recheck? > > That patch also adds CondVar sleep in critical section. That patch is how we > understood that such sleep is dangerous. > > Actual patch to deteac

Re: VM corruption on standby

2025-09-02 Thread Andrey Borodin
> On 3 Sep 2025, at 11:37, Alexander Korotkov wrote: > > Could you, please, recheck? That patch also adds CondVar sleep in critical section. That patch is how we understood that such sleep is dangerous. Actual patch to deteact a problem is much simpler: ``` diff --git a/src/backend/storage/

Re: VM corruption on standby

2025-09-02 Thread Alexander Korotkov
On Tue, Aug 12, 2025 at 8:38 AM Kirill Reshke wrote: > On Wed, 6 Aug 2025 at 20:00, Andrey Borodin wrote: > > > > Hi hackers! > > > > I was reviewing the patch about removing xl_heap_visible and found the VM\WAL machinery very interesting. > > At Yandex we had several incidents with corrupted VM

Re: VM corruption on standby

2025-08-28 Thread Nathan Bossart
On Mon, Aug 25, 2025 at 10:07:26AM -0500, Nathan Bossart wrote: > Now that this is reverted, can the related open item be marked as resolved? Since there has been no further discussion, I will go ahead and resolve the open item. -- nathan

Re: VM corruption on standby

2025-08-25 Thread Nathan Bossart
[RMT hat] On Thu, Aug 21, 2025 at 06:42:48PM -0400, Tom Lane wrote: > Alexander Korotkov writes: >> On Tue, Aug 19, 2025 at 10:50 PM Tom Lane wrote: >>> Therefore, I vote for reverting bc22dc0e0. Hopefully only >>> temporarily, but it's too late to figure out another way for v18, >>> and I don'

Re: VM corruption on standby

2025-08-21 Thread Thomas Munro
On Fri, Aug 22, 2025 at 10:27 AM Alexander Korotkov wrote: > And let's retry it for v19. +1 I'm hoping we can fix PM death handling soon, and then I assume this can go straight back in without modification. CVs are an essential low level synchronisation component that really should work in lots

Re: VM corruption on standby

2025-08-21 Thread Alexander Korotkov
Hi, Tom! On Tue, Aug 19, 2025 at 10:50 PM Tom Lane wrote: > I'm inclined to think that we do want to prohibit WaitEventSetWait > inside a critical section --- it just seems like a bad idea all > around, even without considering this specific failure mode. > Therefore, I vote for reverting bc22dc0

Re: VM corruption on standby

2025-08-21 Thread Tom Lane
Alexander Korotkov writes: > On Tue, Aug 19, 2025 at 10:50 PM Tom Lane wrote: >> Therefore, I vote for reverting bc22dc0e0. Hopefully only >> temporarily, but it's too late to figure out another way for v18, >> and I don't think that bc22dc0e0 is such an essential improvement >> that we can't af

Re: VM corruption on standby

2025-08-21 Thread Michael Paquier
On Fri, Aug 22, 2025 at 01:27:17AM +0300, Alexander Korotkov wrote: > I'm OK about this. Do you mind if I revert bc22dc0e0 myself? > And let's retry it for v19. Yes, agreed that it may be the best thing to do for v18 based on the information we have gathered until now. -- Michael signature.asc

Re: VM corruption on standby

2025-08-20 Thread Thomas Munro
On Wed, Aug 20, 2025 at 3:47 PM Tom Lane wrote: > Having said that, we should in any case have a better story on > what WaitEventSetWait should do after detecting postmaster death. > So I'm all for trying to avoid the proc_exit path if we can > design a better answer. Yeah. I've posted a concept

Re: VM corruption on standby

2025-08-20 Thread Andrey Borodin
> On 20 Aug 2025, at 00:55, Tom Lane wrote: > > Andrey Borodin writes: >> I believe there is a bug with PageIsAllVisible(page) && >> visibilitymap_clear(). But I cannot prove it with an injection point test. >> Because injections points rely on CondVar, that per se creates corruption in >>

Re: VM corruption on standby

2025-08-20 Thread Michael Paquier
On Wed, Aug 20, 2025 at 09:14:04AM -0400, Andres Freund wrote: > On 2025-08-19 23:47:21 -0400, Tom Lane wrote: >> Hm. It still makes me mighty uncomfortable, because the point of a >> critical section is "crash the database if anything goes wrong during >> this bit". Waiting for another process -

Re: VM corruption on standby

2025-08-20 Thread Andres Freund
Hi, On 2025-08-19 23:47:21 -0400, Tom Lane wrote: > Thomas Munro writes: > > On Wed, Aug 20, 2025 at 7:50 AM Tom Lane wrote: > >> I'm inclined to think that we do want to prohibit WaitEventSetWait > >> inside a critical section --- it just seems like a bad idea all > >> around, even without cons

Re: VM corruption on standby

2025-08-19 Thread Kirill Reshke
On Wed, 20 Aug 2025 at 00:50, Tom Lane wrote: > > Kirill Reshke writes: > > I revert this commit (these were conflicts but i resolved them) and > > added assert for crit sections in WaitEventSetWait. > > Your patch still contains some conflict markers :-(. Attached is > a corrected version, jus

Re: VM corruption on standby

2025-08-19 Thread Tom Lane
Thomas Munro writes: > On Wed, Aug 20, 2025 at 7:50 AM Tom Lane wrote: >> I'm inclined to think that we do want to prohibit WaitEventSetWait >> inside a critical section --- it just seems like a bad idea all >> around, even without considering this specific failure mode. > FWIW aio/README.md des

Re: VM corruption on standby

2025-08-19 Thread Thomas Munro
On Wed, Aug 20, 2025 at 11:59 AM Thomas Munro wrote: > they can't all be > blocked in sig_wait() unless there is already a deadlock. s/sig_wait()/sem_wait()/

Re: VM corruption on standby

2025-08-19 Thread Thomas Munro
On Wed, Aug 20, 2025 at 7:50 AM Tom Lane wrote: > I'm inclined to think that we do want to prohibit WaitEventSetWait > inside a critical section --- it just seems like a bad idea all > around, even without considering this specific failure mode. FWIW aio/README.md describes a case where we'd need

Re: VM corruption on standby

2025-08-19 Thread Michael Paquier
On Tue, Aug 19, 2025 at 03:55:41PM -0400, Tom Lane wrote: > Yeah, I was coming to similar conclusions in the reply I just sent: > we don't really want a policy that we can't put injection-point-based > delays inside critical sections. So that infrastructure is leaving > something to be desired. Y

Re: VM corruption on standby

2025-08-19 Thread Tom Lane
Andrey Borodin writes: > I believe there is a bug with PageIsAllVisible(page) && > visibilitymap_clear(). But I cannot prove it with an injection point test. > Because injections points rely on CondVar, that per se creates corruption in > critical section. So I'm reading this discussion and won

Re: VM corruption on standby

2025-08-19 Thread Tom Lane
Kirill Reshke writes: > On Tue, 19 Aug 2025 at 10:32, Thomas Munro wrote: >> Any idea involving deferring the handling of PM death from here >> doesn't seem right: you'd keep waiting for the CV, but the backend >> that would wake you might have exited. Yeah. Taking the check for PM death out of

Re: VM corruption on standby

2025-08-19 Thread Andrey Borodin
> On 19 Aug 2025, at 23:23, Kirill Reshke wrote: > >> We'd probably be best off to get back to the actual bug the >> thread started with, namely whether we aren't doing the wrong >> thing with VM-update order of operations. >> >>regards, tom lane > > My understanding

Re: VM corruption on standby

2025-08-19 Thread Kirill Reshke
On Tue, 19 Aug 2025 at 23:08, Tom Lane wrote: > > Kirill Reshke writes: > > On Tue, 19 Aug 2025 at 21:16, Yura Sokolov wrote: > >> `if (CritSectionCount != 0) _exit(2) else proc_exit(1)` in > >> WaitEventSetWaitBlock () solves the issue of inconsistency IF POSTMASTER IS > >> SIGKILLED, and doesn

Re: VM corruption on standby

2025-08-19 Thread Tom Lane
Kirill Reshke writes: > On Tue, 19 Aug 2025 at 21:16, Yura Sokolov wrote: >> `if (CritSectionCount != 0) _exit(2) else proc_exit(1)` in >> WaitEventSetWaitBlock () solves the issue of inconsistency IF POSTMASTER IS >> SIGKILLED, and doesn't lead to any problem, if postmaster is not SIGKILL-ed >>

Re: VM corruption on standby

2025-08-19 Thread Kirill Reshke
On Tue, 19 Aug 2025 at 20:24, Andres Freund wrote: > > Hi, > > On 2025-08-20 03:19:38 +1200, Thomas Munro wrote: > > On Wed, Aug 20, 2025 at 2:57 AM Andres Freund wrote: > > > On 2025-08-20 02:54:09 +1200, Thomas Munro wrote: > > > > > On linux - the primary OS with OOM killer troubles - I'm pret

Re: VM corruption on standby

2025-08-19 Thread Kirill Reshke
On Tue, 19 Aug 2025 at 21:16, Yura Sokolov wrote: > > > `if (CritSectionCount != 0) _exit(2) else proc_exit(1)` in > WaitEventSetWaitBlock () solves the issue of inconsistency IF POSTMASTER IS > SIGKILLED, and doesn't lead to any problem, if postmaster is not SIGKILL-ed > (since postmaster will S

Re: VM corruption on standby

2025-08-19 Thread Kirill Reshke
On Tue, 19 Aug 2025 at 21:16, Yura Sokolov wrote: > > That is not true. > elog(PANIC) doesn't clear LWLocks. And XLogWrite, which is could be called > from AdvanceXLInsertBuffer, may call elog(PANIC) from several places. > > It doesn't lead to any error, because usually postmaster is alive and it

Re: VM corruption on standby

2025-08-19 Thread Yura Sokolov
19.08.2025 16:43, Kirill Reshke пишет: > On Tue, 19 Aug 2025 at 18:29, Yura Sokolov wrote: > > >> Latch and ConditionVariable (that uses Latch) are among basic >> synchronization primitives in PostgreSQL. > > Sure > >> Therefore they have to work correctly in any place: in critical section, in

Re: VM corruption on standby

2025-08-19 Thread Andres Freund
Hi, On 2025-08-20 03:19:38 +1200, Thomas Munro wrote: > On Wed, Aug 20, 2025 at 2:57 AM Andres Freund wrote: > > On 2025-08-20 02:54:09 +1200, Thomas Munro wrote: > > > > On linux - the primary OS with OOM killer troubles - I'm pretty sure'll > > > > lwlock > > > > waiters would get killed due t

Re: VM corruption on standby

2025-08-19 Thread Thomas Munro
On Wed, Aug 20, 2025 at 2:57 AM Andres Freund wrote: > On 2025-08-20 02:54:09 +1200, Thomas Munro wrote: > > > On linux - the primary OS with OOM killer troubles - I'm pretty sure'll > > > lwlock > > > waiters would get killed due to the postmaster death signal we've > > > configured > > > (c.f.

Re: VM corruption on standby

2025-08-19 Thread Andres Freund
Hi, On 2025-08-20 02:54:09 +1200, Thomas Munro wrote: > > On linux - the primary OS with OOM killer troubles - I'm pretty sure'll > > lwlock > > waiters would get killed due to the postmaster death signal we've configured > > (c.f. PostmasterDeathSignalInit()). > > No, that has a handler that ju

Re: VM corruption on standby

2025-08-19 Thread Thomas Munro
On Wed, Aug 20, 2025 at 1:56 AM Andres Freund wrote: > On 2025-08-19 02:13:43 -0400, Tom Lane wrote: > > > Then wouldn't backends blocked in LWLockAcquire(x) hang forever, after > > > someone who holds x calls _exit()? > > > > If someone who holds x is killed by (say) the OOM killer, how do > > we

Re: VM corruption on standby

2025-08-19 Thread Andres Freund
Hi, On 2025-08-19 02:13:43 -0400, Tom Lane wrote: > Thomas Munro writes: > > On Tue, Aug 19, 2025 at 4:52 AM Tom Lane wrote: > >> But I'm of the opinion that proc_exit > >> is the wrong thing to use after seeing postmaster death, critical > >> section or no. We should assume that system integri

Re: VM corruption on standby

2025-08-19 Thread Kirill Reshke
On Tue, 19 Aug 2025 at 18:29, Yura Sokolov wrote: > Latch and ConditionVariable (that uses Latch) are among basic > synchronization primitives in PostgreSQL. Sure > Therefore they have to work correctly in any place: in critical section, in > wal logging, etc. No. Before bc22dc0e0ddc2dcb6043a

Re: VM corruption on standby

2025-08-19 Thread Yura Sokolov
19.08.2025 16:17, Kirill Reshke пишет: > On Tue, 19 Aug 2025 at 14:14, Kirill Reshke wrote: >> >> This thread is a candidate for [0] >> >> >> [0]https://wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items >> > > Let me summarize this thread for ease of understanding of what's going on: > > Timelin

Re: VM corruption on standby

2025-08-19 Thread Yura Sokolov
19.08.2025 16:09, Andres Freund пишет: > Hi, > > On 2025-08-19 15:56:05 +0300, Yura Sokolov wrote: >> 09.08.2025 22:54, Kirill Reshke пишет: >>> On Thu, 7 Aug 2025 at 21:36, Aleksander Alekseev >>> wrote: >>> Perhaps there was a good reason to update the VM *before* creating WAL records

Re: VM corruption on standby

2025-08-19 Thread Kirill Reshke
On Tue, 19 Aug 2025 at 14:14, Kirill Reshke wrote: > > This thread is a candidate for [0] > > > [0]https://wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items > Let me summarize this thread for ease of understanding of what's going on: Timeline: 1) Andrey Borodin sends a patch (on 6 Aug) claiming

Re: VM corruption on standby

2025-08-19 Thread Andres Freund
Hi, On 2025-08-19 15:56:05 +0300, Yura Sokolov wrote: > 09.08.2025 22:54, Kirill Reshke пишет: > > On Thu, 7 Aug 2025 at 21:36, Aleksander Alekseev > > wrote: > > > >> Perhaps there was a good > >> reason to update the VM *before* creating WAL records I'm unaware of. > > > > Looks like 503c730 in

Re: VM corruption on standby

2025-08-19 Thread Yura Sokolov
09.08.2025 22:54, Kirill Reshke пишет: > On Thu, 7 Aug 2025 at 21:36, Aleksander Alekseev > wrote: > >> Perhaps there was a good >> reason to update the VM *before* creating WAL records I'm unaware of. > > Looks like 503c730 intentionally does it this way; however, I have not > yet fully underst

Re: VM corruption on standby

2025-08-19 Thread Yura Sokolov
10.08.2025 08:45, Kirill Reshke пишет: > On Sun, 10 Aug 2025 at 01:55, Aleksander Alekseev > wrote: >> For this reason we have PageHeaderData.pd_lsn for instance - to make sure >> pages are evicted only *after* the record that changed it is written >> to disk (because WAL records can't be applied

Re: VM corruption on standby

2025-08-19 Thread Tomas Vondra
On 8/19/25 11:14, Kirill Reshke wrote: > This thread is a candidate for [0] > > > [0]https://wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items> > Added, with Alexander as an owner (assuming it really is caused by commit bc22dc0e0d. regards -

Re: VM corruption on standby

2025-08-19 Thread Kirill Reshke
This thread is a candidate for [0] [0]https://wiki.postgresql.org/wiki/PostgreSQL_18_Open_Items Best regards, Kirill Reshke

Re: VM corruption on standby

2025-08-18 Thread Kirill Reshke
On Tue, 19 Aug 2025 at 10:32, Thomas Munro wrote: > > I don't know if there are other ways that LWLockReleaseAll() can lead > to persistent corruption that won't be corrected by crash recovery, > but this one is probably new since the following commit, explaining > the failure to reproduce on v17

Re: VM corruption on standby

2025-08-18 Thread Kirill Reshke
On Tue, 19 Aug 2025 at 11:13, Tom Lane wrote: > > Thomas Munro writes: > > On Tue, Aug 19, 2025 at 4:52 AM Tom Lane wrote: > >> But I'm of the opinion that proc_exit > >> is the wrong thing to use after seeing postmaster death, critical > >> section or no. We should assume that system integrity

Re: VM corruption on standby

2025-08-18 Thread Tom Lane
Thomas Munro writes: > On Tue, Aug 19, 2025 at 4:52 AM Tom Lane wrote: >> But I'm of the opinion that proc_exit >> is the wrong thing to use after seeing postmaster death, critical >> section or no. We should assume that system integrity is already >> compromised, and get out as fast as we can w

Re: VM corruption on standby

2025-08-18 Thread Kirill Reshke
Hi! Thank you for putting attention to this. On Tue, 19 Aug 2025 at 10:32, Thomas Munro wrote: > > On Tue, Aug 19, 2025 at 4:52 AM Tom Lane wrote: > > But I'm of the opinion that proc_exit > > is the wrong thing to use after seeing postmaster death, critical > > section or no. We should assume

Re: VM corruption on standby

2025-08-18 Thread Thomas Munro
On Tue, Aug 19, 2025 at 4:52 AM Tom Lane wrote: > But I'm of the opinion that proc_exit > is the wrong thing to use after seeing postmaster death, critical > section or no. We should assume that system integrity is already > compromised, and get out as fast as we can with as few side-effects > as

Re: VM corruption on standby

2025-08-18 Thread Tom Lane
Kirill Reshke writes: > On Sun, 17 Aug 2025 at 19:33, Tom Lane wrote: >> I do not like this patch one bit: it will replace one set of problems >> with another set, namely systems that fail to shut down. > I did not observe this during my by-hand testing. I am under the > impression that CRIT sec

Re: VM corruption on standby

2025-08-18 Thread Kirill Reshke
On Mon, 18 Aug 2025 at 13:15, I wrote: > > I do not like this patch one bit: it will replace one set of problems > > with another set, namely systems that fail to shut down. > > I did not observe this during my by-hand testing. I am sorry: I was wrong. This is exactly what happens in this test (mo

Re: VM corruption on standby

2025-08-18 Thread Kirill Reshke
Hi! Thank you for putting attention to this. On Sun, 17 Aug 2025 at 19:33, Tom Lane wrote: > > Kirill Reshke writes: > > [ v1-0001-Do-not-exit-on-postmaster-death-ever-inside-CRIT-.patch ] > > I do not like this patch one bit: it will replace one set of problems > with another set, namely system

Re: VM corruption on standby

2025-08-17 Thread Andrey Borodin
> On 17 Aug 2025, at 17:33, Tom Lane wrote: > > So I think the correct fix here is s/proc_exit(1)/_exit(2)/ in the > places that are responding to postmaster death. +1. But should we _exit(2) only in critical section or always in case of postmaster death? Another question that was botherin

Re: VM corruption on standby

2025-08-17 Thread Tom Lane
Kirill Reshke writes: > [ v1-0001-Do-not-exit-on-postmaster-death-ever-inside-CRIT-.patch ] I do not like this patch one bit: it will replace one set of problems with another set, namely systems that fail to shut down. I think the actual bug here is the use of proc_exit(1) after observing postma

Re: VM corruption on standby

2025-08-13 Thread Kirill Reshke
On Thu, 14 Aug 2025 at 10:41, Kirill Reshke wrote: > o I am trying to reproduce is following: > > 1) Some process p1 locks some buffer (name it buf1), enters CRIT > section, calls MarkBufferDirty and hangs inside XLogInsert on CondVar > in (GetXLogBuffer -> AdvanceXLInsertBuffer). > 2) CHECKPOINT

Re: VM corruption on standby

2025-08-13 Thread Kirill Reshke
On Wed, 13 Aug 2025 at 16:15, I wrote: > I did not find any doc or other piece of information indicating > whether WaitEventSetWait and critical sections are allowed. But I do > thing this is bad, because we do not process interruptions during > critical sections, so it is unclear to me why we shou

Re: VM corruption on standby

2025-08-13 Thread Kirill Reshke
On Tue, 12 Aug 2025 at 13:00, I wrote: > While this aims to find existing VM corruption (i mean, in PG <= 17), > this reproducer does not seem to work on pg17. At least, I did not > manage to reproduce this scenario on pg17. > > This makes me think this exact corruption may be pg18-only. Is it > p

Re: VM corruption on standby

2025-08-12 Thread Kirill Reshke
On Wed, 6 Aug 2025 at 20:00, Andrey Borodin wrote: > > Hi hackers! > > I was reviewing the patch about removing xl_heap_visible and found the VM\WAL > machinery very interesting. > At Yandex we had several incidents with corrupted VM and on pgconf.dev > colleagues from AWS confirmed that they sa

Re: VM corruption on standby

2025-08-11 Thread Kirill Reshke
On Tue, 12 Aug 2025 at 10:38, I wrote: > CHECKPOINT > somehow manages to flush the heap page when instance kill-9-ed. This corruption does not reproduce without CHECKPOINT call, however I do not see any suspicious syscal that CHECKPOINT's process does. It does not write anything to disk here, isn

Re: VM corruption on standby

2025-08-11 Thread Kirill Reshke
On Wed, 6 Aug 2025 at 20:00, Andrey Borodin wrote: > > Hi hackers! > > I was reviewing the patch about removing xl_heap_visible and found the VM\WAL > machinery very interesting. > At Yandex we had several incidents with corrupted VM and on pgconf.dev > colleagues from AWS confirmed that they sa

Re: VM corruption on standby

2025-08-09 Thread Andrey Borodin
> On 9 Aug 2025, at 23:54, Aleksander Alekseev wrote: > > IMHO: logging the changes first, then allowing to evict the page. VM and BufferManager code does not allow flush of a buffer until changes are logged. The problem is that our crash-exiting system destroys locks that protect buffer fr

Re: VM corruption on standby

2025-08-09 Thread Kirill Reshke
On Sun, 10 Aug 2025 at 01:55, Aleksander Alekseev wrote: > For this reason we have PageHeaderData.pd_lsn for instance - to make sure > pages are evicted only *after* the record that changed it is written > to disk (because WAL records can't be applied to pages from the > future). We don't bump th

Re: VM corruption on standby

2025-08-09 Thread Aleksander Alekseev
Hi Andrey, > 0. checkpointer is going to flush a heap buffer but waits on content lock > 1. client is resetting PD_ALL_VISIBLE from page > 2. postmaster is killed and command client to go down > 3. client calls LWLockReleaseAll() at ProcKill() (?) > 4. checkpointer flushes buffer with reset PG_ALL

Re: VM corruption on standby

2025-08-09 Thread Kirill Reshke
On Thu, 7 Aug 2025 at 21:36, Aleksander Alekseev wrote: > Perhaps there was a good > reason to update the VM *before* creating WAL records I'm unaware of. Looks like 503c730 intentionally does it this way; however, I have not yet fully understood the reasoning behind it. -- Best regards, Kir

Re: VM corruption on standby

2025-08-09 Thread Andrey Borodin
> On 9 Aug 2025, at 18:28, Andrey Borodin wrote: > > Also I investigated that in a moment of kill -9 checkpointer flushes heap > page to disk despite content lock. I haven't found who released content lock > though. I've written this message and understood: its LWLockReleaseAll(). 0. check

Re: VM corruption on standby

2025-08-09 Thread Andrey Borodin
> On 7 Aug 2025, at 19:36, Aleksander Alekseev wrote: > > Maybe I misunderstood the intent of the test. You understood precisely my intent of writing the test. But it fail not due to a bug I anticipated! So far I noticed that if I move injection point before PageClearAllVisible(BufferGetPa

Re: VM corruption on standby

2025-08-07 Thread Aleksander Alekseev
Hi Andrey, > the test passes because you moved injection point to a very safe position > [...] > I want to emphasize that it seems to me that position of injection point is > not a hint, but rather coincidental. Well, I wouldn't say that the test passes merely because the location of the injecti

Re: VM corruption on standby

2025-08-07 Thread Andrey Borodin
> On 7 Aug 2025, at 17:09, Aleksander Alekseev wrote: > > If my understanding is correct, we should make a WAL record with the > XLH_LOCK_ALL_FROZEN_CLEARED flag *before* we modify the VM but within > the same critical section (in order to avoid race conditions within > the same backend). Wel

Re: VM corruption on standby

2025-08-07 Thread Andrey Borodin
> On 7 Aug 2025, at 18:54, Andrey Borodin wrote: > > moved injection point to a very safe position. BTW, your fix also fixes ALL_FROZEN stuff, just because WAL for heap insert is already emitted by the time of -9. I want to emphasize that it seems to me that position of injection point is n

Re: VM corruption on standby

2025-08-07 Thread Aleksander Alekseev
Hi, > If my understanding is correct, we should make a WAL record with the > XLH_LOCK_ALL_FROZEN_CLEARED flag *before* we modify the VM but within > the same critical section [...] > > A draft patch is attached. It makes the test pass and doesn't seem to > break any other tests. > > Thoughts? In

Re: VM corruption on standby

2025-08-07 Thread Aleksander Alekseev
> > This is a tricky bug. Do you also have a proposal of a particular fix? > > If my understanding is correct, we should make a WAL record with the > XLH_LOCK_ALL_FROZEN_CLEARED flag *before* we modify the VM but within > the same critical section (in order to avoid race conditions within > the sam

Re: VM corruption on standby

2025-08-07 Thread Aleksander Alekseev
Hi again, > I meant instance, not backend. Sorry for confusion. It looks like I completely misunderstood what START_CRIT_SECTION() / END_CRIT_SECTION() are for here. Simply ignore this part :) Apologies for the noise.

Re: VM corruption on standby

2025-08-07 Thread Aleksander Alekseev
Hi, > This is a tricky bug. Do you also have a proposal of a particular fix? If my understanding is correct, we should make a WAL record with the XLH_LOCK_ALL_FROZEN_CLEARED flag *before* we modify the VM but within the same critical section (in order to avoid race conditions within the same back

Re: VM corruption on standby

2025-08-07 Thread Aleksander Alekseev
Hi Andrey, > I was reviewing the patch about removing xl_heap_visible and found the VM\WAL > machinery very interesting. > At Yandex we had several incidents with corrupted VM and on pgconf.dev > colleagues from AWS confirmed that they saw something similar too. > So I toyed around and accidenta