Re: Is this a problem in GenericXLogFinish()?
On Wed, Apr 10, 2024 at 03:28:22PM +0530, Amit Kapila wrote: > I can understand this comment as I am aware of this code but not sure > it would be equally easy for the people first time looking at this > code. One may try to find the equivalent assertion in > _hash_freeovflpage(). The alternative could be: "Ensure that the > required flags are set when there are no tuples. See > _hash_freeovflpage().". I am also fine if you prefer to go with your > proposed comment. Yes, I can see your point about why that's confusing. Your suggestion is much better, so after a second look I've used your version of the comment and applied the patch on HEAD. I am wondering if we have other problems like that with dirty buffers at replay. Perhaps I should put my nose more onto the replay paths and extend these automated checks with wal_consistency_checking. -- Michael signature.asc Description: PGP signature
Re: Is this a problem in GenericXLogFinish()?
On Wed, Apr 10, 2024 at 1:27 PM Michael Paquier wrote: > > On Tue, Apr 09, 2024 at 10:25:36AM +, Hayato Kuroda (Fujitsu) wrote: > >> On Fri, Apr 05, 2024 at 06:22:58AM +, Hayato Kuroda (Fujitsu) wrote: > >> I'm slightly annoyed by the fact that there is no check on > >> is_prim_bucket_same_wrt for buffer 1 in the BLK_NEEDS_REDO case to > >> show the symmetry between the insert and replay paths. Shouldn't > >> there be at least an assert for that in the branch when there are no > >> tuples (imagine an else to cover xldata->ntups == 0)? I mean between > >> just before updating the hash page's opaque area when > >> is_prev_bucket_same_wrt. > > > > Indeed, added an Assert() in else part. Was it same as your expectation? > > Yep, close enough. Thanks to the insert path we know that there will > be no tuples if (is_prim_bucket_same_wrt || is_prev_bucket_same_wrt), > and the replay path where the assertion is added. > It is fine to have an assertion for this path. + else + { + /* + * See _hash_freeovflpage() which has a similar assertion when + * there are no tuples. + */ + Assert(xldata->is_prim_bucket_same_wrt || +xldata->is_prev_bucket_same_wrt); I can understand this comment as I am aware of this code but not sure it would be equally easy for the people first time looking at this code. One may try to find the equivalent assertion in _hash_freeovflpage(). The alternative could be: "Ensure that the required flags are set when there are no tuples. See _hash_freeovflpage().". I am also fine if you prefer to go with your proposed comment. Otherwise, the patch looks good to me. -- With Regards, Amit Kapila.
Re: Is this a problem in GenericXLogFinish()?
On Tue, Apr 09, 2024 at 10:25:36AM +, Hayato Kuroda (Fujitsu) wrote: >> On Fri, Apr 05, 2024 at 06:22:58AM +, Hayato Kuroda (Fujitsu) wrote: >> I'm slightly annoyed by the fact that there is no check on >> is_prim_bucket_same_wrt for buffer 1 in the BLK_NEEDS_REDO case to >> show the symmetry between the insert and replay paths. Shouldn't >> there be at least an assert for that in the branch when there are no >> tuples (imagine an else to cover xldata->ntups == 0)? I mean between >> just before updating the hash page's opaque area when >> is_prev_bucket_same_wrt. > > Indeed, added an Assert() in else part. Was it same as your expectation? Yep, close enough. Thanks to the insert path we know that there will be no tuples if (is_prim_bucket_same_wrt || is_prev_bucket_same_wrt), and the replay path where the assertion is added. >> I've been thinking about ways to make such cases detectable in an >> automated fashion. The best choice would be >> verifyBackupPageConsistency(), just after RestoreBlockImage() on the >> copy of the block image stored in WAL before applying the page masking >> which would mask the LSN. A problem, unfortunately, is that we would >> need to transfer the knowledge of REGBUF_NO_CHANGE as a new BKPIMAGE_* >> flag so we would be able to track if the block rebuilt from the record >> has the *same* LSN as the copy used for the consistency check. So >> this edge consistency case would come at a cost, I am afraid, and the >> 8 bits of bimg_info are precious :/ > > I could not decide that the change has more benefit than its cost, so I did > nothing for it. It does not prevent doing an implementation that can be used for some test validation in the context of this thread. Using this idea, I have done the attached 0002. This is not something to merge into the tree, of course, just a toy to: - Validate the fix for the problem we know. - More braodly, check if there are other areas covered by the main regression test suite if there are other problems. And from what I can see, applying 0002 without 0001 causes the following test to fail as the standby chokes on a inconsistent LSN with a FATAL because the LSN of the apply page coming from the primary and the LSN saved in the page replayed don't match. Here is a command to reproduce the problem: cd src/test/recovery/ && \ PG_TEST_EXTRA=wal_consistency_checking \ PROVE_TESTS=t/027_stream_regress.pl make check And then in the logs you'd see stuff like: 2024-04-10 16:52:21.844 JST [2437] FATAL: inconsistent page LSN replayed 0/A7E5CD18 primary 0/A7E56348 2024-04-10 16:52:21.844 JST [2437] CONTEXT: WAL redo at 0/A7E59F68 for Hash/SQUEEZE_PAGE: prevblkno 28, nextblkno 4294967295, ntups 0, is_primary T; blkref #1: rel 1663/16384/25434, blk 7 FPW; blkref #2: rel 1663/16384/25434, blk 29 FPW; blkref #3: rel 1663/16384/25434, blk 28 FPW; blkref #5: rel 1663/16384/25434, blk 9 FPW; blkref #6: rel 1663/16384/25434, blk 0 FPW I don't see other areas with a similar problem, at the extent of the core regression tests, that is to say. That's my way to say that your patch looks good to me and that I'm planning to apply it to fix the issue. This shows me a more interesting issue unrelated to this thread: 027_stream_regress.pl would be stuck if the standby finds an inconsistent page under wal_consistency_checking. This needs to be fixed before I'm able to create a buildfarm animal with this setup. I'll spawn a new thread about that tomorrow. -- Michael From 4c6597f2ce57439696aecdbda08b374857ab715d Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Wed, 10 Apr 2024 16:47:29 +0900 Subject: [PATCH v3 1/2] Fix inconsistency with replay of hash squeeze record for clean buffers Blah. --- src/backend/access/hash/hash_xlog.c | 21 +++-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c index cb1a63cfee..d7ba74579b 100644 --- a/src/backend/access/hash/hash_xlog.c +++ b/src/backend/access/hash/hash_xlog.c @@ -666,6 +666,7 @@ hash_xlog_squeeze_page(XLogReaderState *record) char *data; Size datalen; uint16 ninserted = 0; + bool mod_wbuf = false; data = begin = XLogRecGetBlockData(record, 1, &datalen); @@ -695,6 +696,17 @@ hash_xlog_squeeze_page(XLogReaderState *record) ninserted++; } + + mod_wbuf = true; + } + else + { + /* + * See _hash_freeovflpage() which has a similar assertion when + * there are no tuples. + */ + Assert(xldata->is_prim_bucket_same_wrt || + xldata->is_prev_bucket_same_wrt); } /* @@ -711,10 +723,15 @@ hash_xlog_squeeze_page(XLogReaderState *record) HashPageOpaque writeopaque = HashPageGetOpaque(writepage); writeopaque->hasho_nextblkno = xldata->nextblkno; + mod_wbuf = true; } - PageSetLSN(writepage, lsn); - MarkBufferDirty(writebuf); + /* Set LSN and mark writebuf dirty iff it is modified */ + if (mod_wbuf) + { + PageSetLSN(writep
RE: Is this a problem in GenericXLogFinish()?
Dear Michael, > On Fri, Apr 05, 2024 at 06:22:58AM +, Hayato Kuroda (Fujitsu) wrote: > > Thanks for pointing out. Yes, we fixed a behavior by the commit aa5edbe37, > > but we missed the redo case. I made a fix patch based on the suggestion [1]. > > + boolmod_buf = false; > > Perhaps you could name that mod_wbuf, to be consistent with the WAL > insert path. Right, fixed. > I'm slightly annoyed by the fact that there is no check on > is_prim_bucket_same_wrt for buffer 1 in the BLK_NEEDS_REDO case to > show the symmetry between the insert and replay paths. Shouldn't > there be at least an assert for that in the branch when there are no > tuples (imagine an else to cover xldata->ntups == 0)? I mean between > just before updating the hash page's opaque area when > is_prev_bucket_same_wrt. Indeed, added an Assert() in else part. Was it same as your expectation? > I've been thinking about ways to make such cases detectable in an > automated fashion. The best choice would be > verifyBackupPageConsistency(), just after RestoreBlockImage() on the > copy of the block image stored in WAL before applying the page masking > which would mask the LSN. A problem, unfortunately, is that we would > need to transfer the knowledge of REGBUF_NO_CHANGE as a new BKPIMAGE_* > flag so we would be able to track if the block rebuilt from the record > has the *same* LSN as the copy used for the consistency check. So > this edge consistency case would come at a cost, I am afraid, and the > 8 bits of bimg_info are precious :/ I could not decide that the change has more benefit than its cost, so I did nothing for it. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ v2_add_modbuf_flag.diff Description: v2_add_modbuf_flag.diff
Re: Is this a problem in GenericXLogFinish()?
On Fri, Apr 05, 2024 at 06:22:58AM +, Hayato Kuroda (Fujitsu) wrote: > Thanks for pointing out. Yes, we fixed a behavior by the commit aa5edbe37, > but we missed the redo case. I made a fix patch based on the suggestion [1]. + boolmod_buf = false; Perhaps you could name that mod_wbuf, to be consistent with the WAL insert path. I'm slightly annoyed by the fact that there is no check on is_prim_bucket_same_wrt for buffer 1 in the BLK_NEEDS_REDO case to show the symmetry between the insert and replay paths. Shouldn't there be at least an assert for that in the branch when there are no tuples (imagine an else to cover xldata->ntups == 0)? I mean between just before updating the hash page's opaque area when is_prev_bucket_same_wrt. I've been thinking about ways to make such cases detectable in an automated fashion. The best choice would be verifyBackupPageConsistency(), just after RestoreBlockImage() on the copy of the block image stored in WAL before applying the page masking which would mask the LSN. A problem, unfortunately, is that we would need to transfer the knowledge of REGBUF_NO_CHANGE as a new BKPIMAGE_* flag so we would be able to track if the block rebuilt from the record has the *same* LSN as the copy used for the consistency check. So this edge consistency case would come at a cost, I am afraid, and the 8 bits of bimg_info are precious :/ -- Michael signature.asc Description: PGP signature
Re: Is this a problem in GenericXLogFinish()?
On Fri, Apr 05, 2024 at 01:26:29PM +0900, Michael Paquier wrote: > On Wed, Feb 07, 2024 at 06:42:21PM +0900, Michael Paquier wrote: >> On Wed, Feb 07, 2024 at 02:08:42PM +0530, Amit Kapila wrote: >> > Thanks for the report and looking into it. Pushed! >> >> Thanks Amit for the commit and Kuroda-san for the patch! > > I have been pinged about this thread and I should have looked a bit > closer, but wait a minute here. I have forgotten to mention that Zubeyr Eryilmaz, a colleague, should be credited here. -- Michael signature.asc Description: PGP signature
RE: Is this a problem in GenericXLogFinish()?
Dear Michael, > There is still some divergence between the code path of > _hash_freeovflpage() and the replay in hash_xlog_squeeze_page() when > squeezing a page: we do not set the LSN of the write buffer if > (xlrec.ntups <= 0 && xlrec.is_prim_bucket_same_wrt && > !xlrec.is_prev_bucket_same_wrt) when generating the squeeze record, > but at replay we call PageSetLSN() on the write buffer and mark it > dirty in this case. Isn't that incorrect? It seems to me that we > should be able to make the conditional depending on the state of the > xl_hash_squeeze_page record, no? Thanks for pointing out. Yes, we fixed a behavior by the commit aa5edbe37, but we missed the redo case. I made a fix patch based on the suggestion [1]. Below part contained my analysis and how I reproduced. I posted for clarifying the issue to others. = ## Pointed issue Assuming the case - xlrec.ntups is 0, - xlrec.is_prim_bucket_same_wrt is true, and - xlrec.is_prev_bucket_same_wrt is false. This meant that there are several overflow pages and the tail deadpage is removing. In this case, the primary page is not have to be modified. While doing the normal operation, the removal is done in _hash_freeovflpage(). If above condition is met, mod_wbuf is not set to true so PageSetLSN() is skipped. While doing the recovery, the squeeze and removal is done in hash_xlog_squeeze_page(). In this function PageSetLSN() is set unconditionally. He said in this case the PageSetLSN should be avoided as well. ## Analysis IIUC there is the same issue with pointed by [2]. He said the PageSetLSN() should be set when the page is really modified. In the discussing case, wbug is not modified (just removing the tail entry) so that no need to assign LSN to it. However, we might miss to update the redo case as well. ## How to reproduce I could reproduce the case below steps. 1. Added the debug log like [3] 2. Constructed a physical replication. 3. Ran hash_index.sql 4. Found the added debug log. [1]: https://www.postgresql.org/message-id/CAA4eK1%2BayneM-8mSRC0iWpDMnm39EwDoqgiOCSqrrMLcdnUnAA%40mail.gmail.com [2]: https://www.postgresql.org/message-id/ZbyVVG_7eW3YD5-A%40paquier.xyz [3]: ``` --- a/src/backend/access/hash/hash_xlog.c +++ b/src/backend/access/hash/hash_xlog.c @@ -713,6 +713,11 @@ hash_xlog_squeeze_page(XLogReaderState *record) writeopaque->hasho_nextblkno = xldata->nextblkno; } +if (xldata->ntups == 0 && +xldata->is_prim_bucket_same_wrt && +!xldata->is_prev_bucket_same_wrt) +elog(LOG, "XXX no need to set PageSetLSN"); + ``` Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ add_modbuf_flag.diff Description: add_modbuf_flag.diff
Re: Is this a problem in GenericXLogFinish()?
On Fri, Apr 5, 2024 at 9:56 AM Michael Paquier wrote: > > On Wed, Feb 07, 2024 at 06:42:21PM +0900, Michael Paquier wrote: > > On Wed, Feb 07, 2024 at 02:08:42PM +0530, Amit Kapila wrote: > > > Thanks for the report and looking into it. Pushed! > > > > Thanks Amit for the commit and Kuroda-san for the patch! > > I have been pinged about this thread and I should have looked a bit > closer, but wait a minute here. > > There is still some divergence between the code path of > _hash_freeovflpage() and the replay in hash_xlog_squeeze_page() when > squeezing a page: we do not set the LSN of the write buffer if > (xlrec.ntups <= 0 && xlrec.is_prim_bucket_same_wrt && > !xlrec.is_prev_bucket_same_wrt) when generating the squeeze record, > but at replay we call PageSetLSN() on the write buffer and mark it > dirty in this case. Isn't that incorrect? > Agreed. We will try to reproduce this. > It seems to me that we > should be able to make the conditional depending on the state of the > xl_hash_squeeze_page record, no? > I think we can have a flag like mod_buf and set it in both the conditions if (xldata->ntups > 0) and if (xldata->is_prev_bucket_same_wrt). If the flag is set then we can set the LSN and mark buffer dirty. -- With Regards, Amit Kapila.
Re: Is this a problem in GenericXLogFinish()?
On Wed, Feb 07, 2024 at 06:42:21PM +0900, Michael Paquier wrote: > On Wed, Feb 07, 2024 at 02:08:42PM +0530, Amit Kapila wrote: > > Thanks for the report and looking into it. Pushed! > > Thanks Amit for the commit and Kuroda-san for the patch! I have been pinged about this thread and I should have looked a bit closer, but wait a minute here. There is still some divergence between the code path of _hash_freeovflpage() and the replay in hash_xlog_squeeze_page() when squeezing a page: we do not set the LSN of the write buffer if (xlrec.ntups <= 0 && xlrec.is_prim_bucket_same_wrt && !xlrec.is_prev_bucket_same_wrt) when generating the squeeze record, but at replay we call PageSetLSN() on the write buffer and mark it dirty in this case. Isn't that incorrect? It seems to me that we should be able to make the conditional depending on the state of the xl_hash_squeeze_page record, no? -- Michael signature.asc Description: PGP signature
Re: Is this a problem in GenericXLogFinish()?
On Wed, Feb 07, 2024 at 02:08:42PM +0530, Amit Kapila wrote: > Thanks for the report and looking into it. Pushed! Thanks Amit for the commit and Kuroda-san for the patch! -- Michael signature.asc Description: PGP signature
Re: Is this a problem in GenericXLogFinish()?
On Mon, Feb 5, 2024 at 1:33 PM Michael Paquier wrote: > > On Mon, Feb 05, 2024 at 07:57:09AM +, Hayato Kuroda (Fujitsu) wrote: > > You are right. Based on the previous discussions, PageSetLSN() must be > > called > > after the MakeBufferDirty(). REGBUF_NO_CHANGE has been introduced for > > skipping > > these requirements. Definitevely, no_change buffers must not be > > PageSetLSN()'d. > > Other pages, e.g., metabuf, has already been followed the rule. > > At quick glance, this v2 seems kind of right to me: you are setting > the page LSN only when the page is registered in the record and > actually dirtied. > Thanks for the report and looking into it. Pushed! -- With Regards, Amit Kapila.
Re: Is this a problem in GenericXLogFinish()?
On Mon, Feb 05, 2024 at 07:57:09AM +, Hayato Kuroda (Fujitsu) wrote: > You are right. Based on the previous discussions, PageSetLSN() must be called > after the MakeBufferDirty(). REGBUF_NO_CHANGE has been introduced for skipping > these requirements. Definitevely, no_change buffers must not be > PageSetLSN()'d. > Other pages, e.g., metabuf, has already been followed the rule. At quick glance, this v2 seems kind of right to me: you are setting the page LSN only when the page is registered in the record and actually dirtied. -- Michael signature.asc Description: PGP signature
RE: Is this a problem in GenericXLogFinish()?
Dear Amit, > > @@ -692,6 +697,9 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, > Buffer ovflbuf, > if (!xlrec.is_prev_bucket_same_wrt) > wbuf_flags |= REGBUF_NO_CHANGE; > XLogRegisterBuffer(1, wbuf, wbuf_flags); > + > + /* Track the registration status for later use */ > + wbuf_registered = true; > } > > XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD); > @@ -719,7 +727,12 @@ _hash_freeovflpage(Relation rel, Buffer > bucketbuf, Buffer ovflbuf, > > recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE); > > - PageSetLSN(BufferGetPage(wbuf), recptr); > + /* Set LSN to wbuf page buffer only when it is being registered */ > + if (wbuf_registered) > + PageSetLSN(BufferGetPage(wbuf), recptr); > > Why set LSN when the page is not modified (say when we use the flag > REGBUF_NO_CHANGE)? I think you need to use a flag mod_wbuf and set it > in appropriate places during register buffer calls. You are right. Based on the previous discussions, PageSetLSN() must be called after the MakeBufferDirty(). REGBUF_NO_CHANGE has been introduced for skipping these requirements. Definitevely, no_change buffers must not be PageSetLSN()'d. Other pages, e.g., metabuf, has already been followed the rule. I updated the patch based on the requirement. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ v2_avoid_registration.patch Description: v2_avoid_registration.patch
Re: Is this a problem in GenericXLogFinish()?
On Mon, Feb 5, 2024 at 10:00 AM Hayato Kuroda (Fujitsu) wrote: > > > > > Amit, this has been applied as of 861f86beea1c, and I got pinged about > > the fact this triggers inconsistencies because we always set the LSN > > of the write buffer (wbuf in _hash_freeovflpage) but > > XLogRegisterBuffer() would *not* be called when the two following > > conditions happen: > > - When xlrec.ntups <= 0. > > - When !xlrec.is_prim_bucket_same_wrt && !xlrec.is_prev_bucket_same_wrt > > > > And it seems to me that there is still a bug here: there should be no > > point in setting the LSN on the write buffer if we don't register it > > in WAL at all, no? > > Thanks for pointing out, I agreed your saying. PSA the patch for diagnosing > the > issue. > @@ -692,6 +697,9 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf, if (!xlrec.is_prev_bucket_same_wrt) wbuf_flags |= REGBUF_NO_CHANGE; XLogRegisterBuffer(1, wbuf, wbuf_flags); + + /* Track the registration status for later use */ + wbuf_registered = true; } XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD); @@ -719,7 +727,12 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf, recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE); - PageSetLSN(BufferGetPage(wbuf), recptr); + /* Set LSN to wbuf page buffer only when it is being registered */ + if (wbuf_registered) + PageSetLSN(BufferGetPage(wbuf), recptr); Why set LSN when the page is not modified (say when we use the flag REGBUF_NO_CHANGE)? I think you need to use a flag mod_wbuf and set it in appropriate places during register buffer calls. -- With Regards, Amit Kapila.
RE: Is this a problem in GenericXLogFinish()?
Dear Michael, Amit, > > Amit, this has been applied as of 861f86beea1c, and I got pinged about > the fact this triggers inconsistencies because we always set the LSN > of the write buffer (wbuf in _hash_freeovflpage) but > XLogRegisterBuffer() would *not* be called when the two following > conditions happen: > - When xlrec.ntups <= 0. > - When !xlrec.is_prim_bucket_same_wrt && !xlrec.is_prev_bucket_same_wrt > > And it seems to me that there is still a bug here: there should be no > point in setting the LSN on the write buffer if we don't register it > in WAL at all, no? Thanks for pointing out, I agreed your saying. PSA the patch for diagnosing the issue. This patch can avoid the inconsistency due to the LSN setting and output a debug LOG when we met such a case. I executed hash_index.sql and confirmed the log was output [1]. This meant that current test has already had a workload which meets below conditions: - the overflow page has no tuples (xlrec.ntups is 0), - to-be-written page - wbuf - is not the primary (xlrec.is_prim_bucket_same_wrt is false), and - to-be-written buffer is not next to the overflow page (xlrec.is_prev_bucket_same_wrt is false) So, I think my patch (after removing elog(...) part) can fix the issue. Thought? [1]: ``` LOG: XXX: is_wbuf_registered: false CONTEXT: while vacuuming index "hash_cleanup_index" of relation "public.hash_cleanup_heap" STATEMENT: VACUUM hash_cleanup_heap; ``` Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ avoid_registration.patch Description: avoid_registration.patch
Re: Is this a problem in GenericXLogFinish()?
On Fri, Feb 2, 2024 at 12:40 PM Michael Paquier wrote: > > Amit, this has been applied as of 861f86beea1c, and I got pinged about > the fact this triggers inconsistencies because we always set the LSN > of the write buffer (wbuf in _hash_freeovflpage) but > XLogRegisterBuffer() would *not* be called when the two following > conditions happen: > - When xlrec.ntups <= 0. > - When !xlrec.is_prim_bucket_same_wrt && !xlrec.is_prev_bucket_same_wrt > > And it seems to me that there is still a bug here: there should be no > point in setting the LSN on the write buffer if we don't register it > in WAL at all, no? > Right, I see the problem. I'll look into it further. -- With Regards, Amit Kapila.
Re: Is this a problem in GenericXLogFinish()?
On Fri, Dec 01, 2023 at 03:27:33PM +0530, Amit Kapila wrote: > Pushed! Amit, this has been applied as of 861f86beea1c, and I got pinged about the fact this triggers inconsistencies because we always set the LSN of the write buffer (wbuf in _hash_freeovflpage) but XLogRegisterBuffer() would *not* be called when the two following conditions happen: - When xlrec.ntups <= 0. - When !xlrec.is_prim_bucket_same_wrt && !xlrec.is_prev_bucket_same_wrt And it seems to me that there is still a bug here: there should be no point in setting the LSN on the write buffer if we don't register it in WAL at all, no? -- Michael signature.asc Description: PGP signature
Re: Is this a problem in GenericXLogFinish()?
On Thu, Nov 30, 2023 at 4:30 PM Alexander Lakhin wrote: > > 30.11.2023 10:28, Hayato Kuroda (Fujitsu) wrote: > > Again, good catch. Here is my analysis and fix patch. > > I think it is sufficient to add an initialization for writebuf. > > I agree with the change. > Pushed! -- With Regards, Amit Kapila.
RE: Is this a problem in GenericXLogFinish()?
Dear Amit, > I think it is acceptable especially if it increases code > coverage. Can you once check that? PSA the screen shot. I did "PROVE_TESTS="t/027*" make check" in src/test/recovery, and generated a report. Line 661 was not hit in the HEAD, but the screen showed that it was executed. [1]: https://coverage.postgresql.org/src/backend/access/hash/hash_xlog.c.gcov.html#661 Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: Is this a problem in GenericXLogFinish()?
Hello Kuroda-san, 30.11.2023 10:28, Hayato Kuroda (Fujitsu) wrote: Again, good catch. Here is my analysis and fix patch. I think it is sufficient to add an initialization for writebuf. I agree with the change. It aligns hash_xlog_squeeze_page() with hash_xlog_move_page_contents() in regard to initialization of writebuf. Interestingly enough, the discrepancy exists since these functions appeared with c11453ce0, and I can't see what justified adding the initialization inside hash_xlog_move_page_contents() only. So I think that if this doesn't affect performance, having aligned coding (writebuf initialized in both functions) is better. I want to add a test case for it as well. I've tested on my env and found that proposed tuples seems sufficient. (We must fill the primary bucket, so initial 500 is needed. Also, we must add many dead pages to lead squeeze operation. Final page seems OK to smaller value.) I compared the execution time before/after patching, and they are not so different (1077 ms -> 1125 ms). How do you think? I can confirm that the test case proposed demonstrates what is expected and the duration increase is tolerable, as to me. Best regards, Alexander
Re: Is this a problem in GenericXLogFinish()?
On Thu, Nov 30, 2023 at 12:58 PM Hayato Kuroda (Fujitsu) wrote: > > > > > Good catch, thank you for reporting! I will investigate more about it and > > post my > > analysis. > > > > Again, good catch. Here is my analysis and fix patch. > I think it is sufficient to add an initialization for writebuf. > > In the reported case, neither is_prim_bucket_same_wrt nor > is_prev_bucket_same_wrt > is set in the WAL record, and ntups is also zero. This means that the wbuf is > not > written in the WAL record on primary side (see [1]). > Also, there are no reasons to read (and lock) other buffer page because we do > not modify it. Based on them, I think that just initializing as InvalidBuffer > is sufficient. > Agreed. > > I want to add a test case for it as well. I've tested on my env and found > that proposed > tuples seems sufficient. > (We must fill the primary bucket, so initial 500 is needed. Also, we must add > many dead pages to lead squeeze operation. Final page seems OK to smaller > value.) > > I compared the execution time before/after patching, and they are not so > different > (1077 ms -> 1125 ms). > In my environment, it increased from 375ms to 385ms (median of five runs). I think it is acceptable especially if it increases code coverage. Can you once check that? -- With Regards, Amit Kapila.
RE: Is this a problem in GenericXLogFinish()?
Dear Alexander, > > Good catch, thank you for reporting! I will investigate more about it and > post my > analysis. > Again, good catch. Here is my analysis and fix patch. I think it is sufficient to add an initialization for writebuf. In the reported case, neither is_prim_bucket_same_wrt nor is_prev_bucket_same_wrt is set in the WAL record, and ntups is also zero. This means that the wbuf is not written in the WAL record on primary side (see [1]). Also, there are no reasons to read (and lock) other buffer page because we do not modify it. Based on them, I think that just initializing as InvalidBuffer is sufficient. I want to add a test case for it as well. I've tested on my env and found that proposed tuples seems sufficient. (We must fill the primary bucket, so initial 500 is needed. Also, we must add many dead pages to lead squeeze operation. Final page seems OK to smaller value.) I compared the execution time before/after patching, and they are not so different (1077 ms -> 1125 ms). How do you think? [1]: ``` else if (xlrec.is_prim_bucket_same_wrt || xlrec.is_prev_bucket_same_wrt) { uint8 wbuf_flags; /* * A write buffer needs to be registered even if no tuples are * added to it to ensure that we can acquire a cleanup lock on it * if it is the same as primary bucket buffer or update the * nextblkno if it is same as the previous bucket buffer. */ Assert(xlrec.ntups == 0); wbuf_flags = REGBUF_STANDARD; if (!xlrec.is_prev_bucket_same_wrt) wbuf_flags |= REGBUF_NO_CHANGE; XLogRegisterBuffer(1, wbuf, wbuf_flags); } ``` Best Regards, Hayato Kuroda FUJITSU LIMITED initialize_writebuf.patch Description: initialize_writebuf.patch
RE: Is this a problem in GenericXLogFinish()?
Dear Alexander, > > I've discovered that that patch introduced a code path leading to an > uninitialized memory access. > With the following addition to hash_index.sql: > -- Fill overflow pages by "dead" tuples. > BEGIN; > INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000) > as i; > +INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000) as > i; > ROLLBACK; > +INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000) as > i; > > make check -C src/test/recovery/ PROVE_TESTS="t/027*" > when executed under Valgrind, triggers: > ==00:00:02:30.285 97744== Conditional jump or move depends on uninitialised > value(s) > ==00:00:02:30.285 97744==at 0x227585: BufferIsValid (bufmgr.h:303) > ==00:00:02:30.285 97744==by 0x227585: hash_xlog_squeeze_page > (hash_xlog.c:781) > ==00:00:02:30.285 97744==by 0x228133: hash_redo (hash_xlog.c:1083) > ==00:00:02:30.285 97744==by 0x2C2801: ApplyWalRecord > (xlogrecovery.c:1943) > ==00:00:02:30.285 97744==by 0x2C5C52: PerformWalRecovery > (xlogrecovery.c:1774) > ==00:00:02:30.285 97744==by 0x2B63A1: StartupXLOG (xlog.c:5559) > ==00:00:02:30.285 97744==by 0x558165: StartupProcessMain > (startup.c:282) > ==00:00:02:30.285 97744==by 0x54DFE8: AuxiliaryProcessMain > (auxprocess.c:141) > ==00:00:02:30.285 97744==by 0x5546B0: StartChildProcess > (postmaster.c:5331) > ==00:00:02:30.285 97744==by 0x557A53: PostmasterMain > (postmaster.c:1458) > ==00:00:02:30.285 97744==by 0x4720C2: main (main.c:198) > ==00:00:02:30.285 97744== > (in 027_stream_regress_standby_1.log) > > That is, when line > https://coverage.postgresql.org/src/backend/access/hash/hash_xlog.c.gcov.ht > ml#661 > is reached, writebuf stays uninitialized. Good catch, thank you for reporting! I will investigate more about it and post my analysis. Best Regards, Hayato Kuroda FUJITSU LIMITED
Re: Is this a problem in GenericXLogFinish()?
Hello, 13.11.2023 20:21, Robert Haas wrote: On Mon, Nov 13, 2023 at 12:47 AM Hayato Kuroda (Fujitsu) wrote: Moved. I see that this patch was committed, but I'm not very convinced that the approach is correct. The comment says this: I've discovered that that patch introduced a code path leading to an uninitialized memory access. With the following addition to hash_index.sql: -- Fill overflow pages by "dead" tuples. BEGIN; INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000) as i; +INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000) as i; ROLLBACK; +INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 1000) as i; make check -C src/test/recovery/ PROVE_TESTS="t/027*" when executed under Valgrind, triggers: ==00:00:02:30.285 97744== Conditional jump or move depends on uninitialised value(s) ==00:00:02:30.285 97744== at 0x227585: BufferIsValid (bufmgr.h:303) ==00:00:02:30.285 97744== by 0x227585: hash_xlog_squeeze_page (hash_xlog.c:781) ==00:00:02:30.285 97744== by 0x228133: hash_redo (hash_xlog.c:1083) ==00:00:02:30.285 97744== by 0x2C2801: ApplyWalRecord (xlogrecovery.c:1943) ==00:00:02:30.285 97744== by 0x2C5C52: PerformWalRecovery (xlogrecovery.c:1774) ==00:00:02:30.285 97744== by 0x2B63A1: StartupXLOG (xlog.c:5559) ==00:00:02:30.285 97744== by 0x558165: StartupProcessMain (startup.c:282) ==00:00:02:30.285 97744== by 0x54DFE8: AuxiliaryProcessMain (auxprocess.c:141) ==00:00:02:30.285 97744== by 0x5546B0: StartChildProcess (postmaster.c:5331) ==00:00:02:30.285 97744== by 0x557A53: PostmasterMain (postmaster.c:1458) ==00:00:02:30.285 97744== by 0x4720C2: main (main.c:198) ==00:00:02:30.285 97744== (in 027_stream_regress_standby_1.log) That is, when line https://coverage.postgresql.org/src/backend/access/hash/hash_xlog.c.gcov.html#661 is reached, writebuf stays uninitialized. Best regards, Alexander
Re: Is this a problem in GenericXLogFinish()?
On Mon, Nov 13, 2023 at 10:51 PM Robert Haas wrote: > > On Mon, Nov 13, 2023 at 12:47 AM Hayato Kuroda (Fujitsu) > wrote: > > Moved. > > I see that this patch was committed, but I'm not very convinced that > the approach is correct. The comment says this: > > + /* > +* A write buffer needs to be registered even if no tuples are > +* added to it to ensure that we can acquire a cleanup lock on it > +* if it is the same as primary bucket buffer or update the > +* nextblkno if it is same as the previous bucket buffer. > +*/ > > But surely if the buffer is the same as one of those others, then it's > registered anyway, I don't think for others it's registered. For example, consider the case when prevpage and the writepage are the same (aka xlrec.is_prev_bucket_same_wrt is true), it won't be registered in another code path (see comment [1]). > > and if it isn't, then it doesn't need to be. > In the previous example, we need it to update the nextblockno during replay. I am not sure if I understand the scenario you are worried about, so if my response doesn't address your concern, can you please explain a bit more about the scenario you have in mind? [1] - /* * If prevpage and the writepage (block in which we are moving tuples * from overflow) are same, then no need to separately register * prevpage. During replay, we can directly update the nextblock in * writepage. */ -- With Regards, Amit Kapila.
Re: Is this a problem in GenericXLogFinish()?
On Mon, Nov 13, 2023 at 12:47 AM Hayato Kuroda (Fujitsu) wrote: > Moved. I see that this patch was committed, but I'm not very convinced that the approach is correct. The comment says this: + /* +* A write buffer needs to be registered even if no tuples are +* added to it to ensure that we can acquire a cleanup lock on it +* if it is the same as primary bucket buffer or update the +* nextblkno if it is same as the previous bucket buffer. +*/ But surely if the buffer is the same as one of those others, then it's registered anyway, and if it isn't, then it doesn't need to be. -- Robert Haas EDB: http://www.enterprisedb.com
RE: Is this a problem in GenericXLogFinish()?
Dear Amit, Thanks for reviewing! PSA new version patch. > > > Next we should add some test codes. I will continue considering but please > post > > > anything > > > If you have idea. > > > > And I did, PSA the patch. This patch adds two parts in hash_index.sql. > > > > In the first part, the primary bucket page is filled by live tuples and some > overflow > > pages are by dead ones. This leads removal of overflow pages without moving > tuples. > > HEAD will crash if this were executed, which is already reported on hackers. > > > > +-- And do CHECKPOINT and vacuum. Some overflow pages will be removed. > +CHECKPOINT; > +VACUUM hash_cleanup_heap; > > Why do we need CHECKPOINT in the above test? CHECKPOINT command is needed for writing a hash pages to disk. IIUC if the command is omitted, none of tuples are recorded to hash index *file*, so squeeze would not be occurred. You can test by 1) restoring changes for hashovfl.c then 2) removing CHECKPOINT command. Server crash would not be occurred. > > The second one tests a case (ntups == 0 && is_prim_bucket_same_wrt == > false && > > is_prev_bucket_same_wrt == true), which is never done before. > > > > +-- Insert few tuples, the primary bucket page will not satisfy > +INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 50) as i; > > What do you mean by the second part of the sentence (the primary > bucket page will not satisfy)? I meant to say that the primary bucket page still can have more tuples. Maybe I should say "will not be full". Reworded. > Few other minor comments: > === > 1. Can we use ROLLBACK instead of ABORT in the tests? Changed. IIRC they have same meaning, but it seems that most of sql files have "ROLLBACK". > 2. > - for (i = 0; i < nitups; i++) > + for (int i = 0; i < nitups; i++) > > I don't think there is a need to make this unrelated change. Reverted. I thought that the variable i would be used only when nitups>0 so that it could be removed, but it was not my business. > 3. > + /* > + * A write buffer needs to be registered even if no tuples are added to > + * it to ensure that we can acquire a cleanup lock on it if it is the > + * same as primary bucket buffer or update the nextblkno if it is same > + * as the previous bucket buffer. > + */ > + else if (xlrec.is_prim_bucket_same_wrt || xlrec.is_prev_bucket_same_wrt) > + { > + uint8 wbuf_flags; > + > + Assert(xlrec.ntups == 0); > > Can we move this comment inside else if, just before Assert? Moved. Best Regards, Hayato Kuroda FUJITSU LIMITED fix_hash_squeeze_wal_5.patch Description: fix_hash_squeeze_wal_5.patch
Re: Is this a problem in GenericXLogFinish()?
On Fri, Nov 10, 2023 at 9:17 AM Hayato Kuroda (Fujitsu) wrote: > > > Next we should add some test codes. I will continue considering but please > > post > > anything > > If you have idea. > > And I did, PSA the patch. This patch adds two parts in hash_index.sql. > > In the first part, the primary bucket page is filled by live tuples and some > overflow > pages are by dead ones. This leads removal of overflow pages without moving > tuples. > HEAD will crash if this were executed, which is already reported on hackers. > +-- And do CHECKPOINT and vacuum. Some overflow pages will be removed. +CHECKPOINT; +VACUUM hash_cleanup_heap; Why do we need CHECKPOINT in the above test? > The second one tests a case (ntups == 0 && is_prim_bucket_same_wrt == false && > is_prev_bucket_same_wrt == true), which is never done before. > +-- Insert few tuples, the primary bucket page will not satisfy +INSERT INTO hash_cleanup_heap SELECT 1 FROM generate_series(1, 50) as i; What do you mean by the second part of the sentence (the primary bucket page will not satisfy)? Few other minor comments: === 1. Can we use ROLLBACK instead of ABORT in the tests? 2. - for (i = 0; i < nitups; i++) + for (int i = 0; i < nitups; i++) I don't think there is a need to make this unrelated change. 3. + /* + * A write buffer needs to be registered even if no tuples are added to + * it to ensure that we can acquire a cleanup lock on it if it is the + * same as primary bucket buffer or update the nextblkno if it is same + * as the previous bucket buffer. + */ + else if (xlrec.is_prim_bucket_same_wrt || xlrec.is_prev_bucket_same_wrt) + { + uint8 wbuf_flags; + + Assert(xlrec.ntups == 0); Can we move this comment inside else if, just before Assert? -- With Regards, Amit Kapila.
RE: Is this a problem in GenericXLogFinish()?
Dear hackers, > Next we should add some test codes. I will continue considering but please > post > anything > If you have idea. And I did, PSA the patch. This patch adds two parts in hash_index.sql. In the first part, the primary bucket page is filled by live tuples and some overflow pages are by dead ones. This leads removal of overflow pages without moving tuples. HEAD will crash if this were executed, which is already reported on hackers. The second one tests a case (ntups == 0 && is_prim_bucket_same_wrt == false && is_prev_bucket_same_wrt == true), which is never done before. Also, I measured the execution time of before/after patching. Below is a madian for 9 measurements. BEFORE -> AFTER 647 ms -> 710 ms This means that the execution time increased -10%, it will not affect so much. How do you think? Best Regards, Hayato Kuroda FUJITSU LIMITED fix_hash_squeeze_wal_4.patch Description: fix_hash_squeeze_wal_4.patch
RE: Is this a problem in GenericXLogFinish()?
Dear hackers, > Dear Amit, Michael, > > Thanks for making the patch! > > > Why register wbuf at all if there are no tuples to add and it is not > > the same as bucketbuf? Also, I think this won't be correct if prevbuf > > and wrtbuf are the same and also we have no tuples to add to wbuf. I > > have attached a naive and crude way to achieve it. This needs more > > work both in terms of trying to find a better way to change the code > > or ensure this won't break any existing case. I have just run the > > existing tests. Such a fix certainly required more testing. > > I'm verifying the idea (currently it seems OK), but at least there is a > coding error - > wbuf_flags should be uint8 here. PSA the fixed patch. Here is a new patch which is bit refactored. It did: * If-conditions in _hash_freeovflpage() were swapped. * Based on above, an Assert(xlrec.ntups == 0) was added. * A condition in hash_xlog_squeeze_page() was followed the change as well * comments were adjusted Next we should add some test codes. I will continue considering but please post anything If you have idea. Best Regards, Hayato Kuroda FUJITSU LIMITED fix_hash_squeeze_wal_3.patch Description: fix_hash_squeeze_wal_3.patch
RE: Is this a problem in GenericXLogFinish()?
Dear Amit, Michael, Thanks for making the patch! > Why register wbuf at all if there are no tuples to add and it is not > the same as bucketbuf? Also, I think this won't be correct if prevbuf > and wrtbuf are the same and also we have no tuples to add to wbuf. I > have attached a naive and crude way to achieve it. This needs more > work both in terms of trying to find a better way to change the code > or ensure this won't break any existing case. I have just run the > existing tests. Such a fix certainly required more testing. I'm verifying the idea (currently it seems OK), but at least there is a coding error - wbuf_flags should be uint8 here. PSA the fixed patch. Best Regards, Hayato Kuroda FUJITSU LIMITED fix_hash_squeeze_wal_2.patch Description: fix_hash_squeeze_wal_2.patch
Re: Is this a problem in GenericXLogFinish()?
On Wed, Nov 1, 2023 at 12:54 PM Michael Paquier wrote: > > On Mon, Oct 30, 2023 at 03:54:39PM +0530, Amit Kapila wrote: > > On Sat, Oct 28, 2023 at 4:30 PM Michael Paquier wrote: > >> On Sat, Oct 28, 2023 at 03:45:13PM +0530, Amit Kapila wrote: > >> > Yes, we need it to exclude any concurrent in-progress scans that could > >> > return incorrect tuples during bucket squeeze operation. > >> > >> Thanks. So I assume that we should just set REGBUF_NO_CHANGE when the > >> primary and write buffers are the same and there are no tuples to > >> move. Say with something like the attached? > > > > What if the primary and write buffer are not the same but ntups is > > zero? Won't we hit the assertion again in that case? > > The code tells that it should be able to handle such a case, > It should be possible when we encounter some page in between the chain that has all dead items and write buffer points to some page after the primary bucket page and before the read buffer page. > so this > would mean to set REGBUF_NO_CHANGE only when we have no tuples to > move: > -XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD); > +/* > + * A cleanup lock is still required on the write buffer even > + * if there are no tuples to move, so it needs to be registered > + * in this case. > + */ > +wbuf_flags = REGBUF_STANDARD; > +if (xlrec.ntups == 0) > +wbuf_flags |= REGBUF_NO_CHANGE; > +XLogRegisterBuffer(1, wbuf, wbuf_flags); > Why register wbuf at all if there are no tuples to add and it is not the same as bucketbuf? Also, I think this won't be correct if prevbuf and wrtbuf are the same and also we have no tuples to add to wbuf. I have attached a naive and crude way to achieve it. This needs more work both in terms of trying to find a better way to change the code or ensure this won't break any existing case. I have just run the existing tests. Such a fix certainly required more testing. -- With Regards, Amit Kapila. fix_hash_squeeze_wal_1.patch Description: Binary data
Re: Is this a problem in GenericXLogFinish()?
On Mon, Oct 30, 2023 at 03:54:39PM +0530, Amit Kapila wrote: > On Sat, Oct 28, 2023 at 4:30 PM Michael Paquier wrote: >> On Sat, Oct 28, 2023 at 03:45:13PM +0530, Amit Kapila wrote: >> > Yes, we need it to exclude any concurrent in-progress scans that could >> > return incorrect tuples during bucket squeeze operation. >> >> Thanks. So I assume that we should just set REGBUF_NO_CHANGE when the >> primary and write buffers are the same and there are no tuples to >> move. Say with something like the attached? > > What if the primary and write buffer are not the same but ntups is > zero? Won't we hit the assertion again in that case? The code tells that it should be able to handle such a case, so this would mean to set REGBUF_NO_CHANGE only when we have no tuples to move: -XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD); +/* + * A cleanup lock is still required on the write buffer even + * if there are no tuples to move, so it needs to be registered + * in this case. + */ +wbuf_flags = REGBUF_STANDARD; +if (xlrec.ntups == 0) +wbuf_flags |= REGBUF_NO_CHANGE; +XLogRegisterBuffer(1, wbuf, wbuf_flags); Anyway, there is still a hole here: the regression tests have zero coverage for the case where there are no tuples to move if !is_prim_bucket_same_wrt. There are only two queries that stress the squeeze path with no tuples, both use is_prim_bucket_same_wrt: INSERT INTO hash_split_heap SELECT a/2 FROM generate_series(1, 25000) a; VACUUM hash_split_heap; Perhaps this should have more coverage to make sure that all these scenarios are covered at replay (likely with 027_stream_regress.pl)? The case posted by Alexander at [1] falls under the same category (is_prim_bucket_same_wrt with no tuples). [1]: https://www.postgresql.org/message-id/f045c8f7-ee24-ead6-3679-c04a43d21...@gmail.com -- Michael signature.asc Description: PGP signature
Re: Is this a problem in GenericXLogFinish()?
On Mon, Oct 30, 2023 at 11:15 PM Amit Kapila wrote: > My understanding is the same. It is possible that my wording is not > clear. Let me try to clarify again, Michael asked: "do we need the > cleanup lock on the write buffer even if there are no tuples, and even > if primary bucket and the write bucket are the same?" My reading of > his question was do we need a cleanup lock even if the primary bucket > and write bucket are the same which means the write bucket will be the > first page in the chain and we need a cleanup lock on it. I think the > second condition (no tuples) seems irrelevant here as whether that is > true or false we need a cleanup lock. Ah, OK, I see now. I missed the fact that, per what Michael wrote, you were assuming the primary buffer and the write buffer were the same. In that case I agree, but the whole formulation seems backwards. I think the clearer way to say this is: we need a cleanup lock on the primary bucket page no matter what; and we need to write tuples into the write buffer if there are any tuples to write but not if there aren't. If the two buffers are the same then the requirements are added together. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Is this a problem in GenericXLogFinish()?
On Mon, Oct 30, 2023 at 7:13 PM Robert Haas wrote: > > On Sat, Oct 28, 2023 at 6:15 AM Amit Kapila wrote: > > > Hmm. So my question is: do we need the cleanup lock on the write > > > buffer even if there are no tuples, and even if primary bucket and the > > > write bucket are the same? > > > > Yes, we need it to exclude any concurrent in-progress scans that could > > return incorrect tuples during bucket squeeze operation. > > Amit, thanks for weighing in, but I'm not convinced. I thought we only > ever used a cleanup lock on the main bucket page to guard against > concurrent scans. > My understanding is the same. It is possible that my wording is not clear. Let me try to clarify again, Michael asked: "do we need the cleanup lock on the write buffer even if there are no tuples, and even if primary bucket and the write bucket are the same?" My reading of his question was do we need a cleanup lock even if the primary bucket and write bucket are the same which means the write bucket will be the first page in the chain and we need a cleanup lock on it. I think the second condition (no tuples) seems irrelevant here as whether that is true or false we need a cleanup lock. > Here you seem to be saying that we need a cleanup > lock on some page that may be an overflow page somewhere in the middle > of the chain, and that doesn't seem right to me. > Sorry, I don't intend to say this. -- With Regards, Amit Kapila.
Re: Is this a problem in GenericXLogFinish()?
On Sat, Oct 28, 2023 at 6:15 AM Amit Kapila wrote: > > Hmm. So my question is: do we need the cleanup lock on the write > > buffer even if there are no tuples, and even if primary bucket and the > > write bucket are the same? > > Yes, we need it to exclude any concurrent in-progress scans that could > return incorrect tuples during bucket squeeze operation. Amit, thanks for weighing in, but I'm not convinced. I thought we only ever used a cleanup lock on the main bucket page to guard against concurrent scans. Here you seem to be saying that we need a cleanup lock on some page that may be an overflow page somewhere in the middle of the chain, and that doesn't seem right to me. So ... are you sure? If yes, can you provide any more detailed justification? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Is this a problem in GenericXLogFinish()?
On Sat, Oct 28, 2023 at 4:30 PM Michael Paquier wrote: > > On Sat, Oct 28, 2023 at 03:45:13PM +0530, Amit Kapila wrote: > > Yes, we need it to exclude any concurrent in-progress scans that could > > return incorrect tuples during bucket squeeze operation. > > Thanks. So I assume that we should just set REGBUF_NO_CHANGE when the > primary and write buffers are the same and there are no tuples to > move. Say with something like the attached? > What if the primary and write buffer are not the same but ntups is zero? Won't we hit the assertion again in that case? -- With Regards, Amit Kapila.
Re: Is this a problem in GenericXLogFinish()?
On Sat, Oct 28, 2023 at 03:45:13PM +0530, Amit Kapila wrote: > Yes, we need it to exclude any concurrent in-progress scans that could > return incorrect tuples during bucket squeeze operation. Thanks. So I assume that we should just set REGBUF_NO_CHANGE when the primary and write buffers are the same and there are no tuples to move. Say with something like the attached? -- Michael diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c index 9d1ff20b92..9928068179 100644 --- a/src/backend/access/hash/hashovfl.c +++ b/src/backend/access/hash/hashovfl.c @@ -647,6 +647,7 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf, xl_hash_squeeze_page xlrec; XLogRecPtr recptr; int i; + int wbuf_flags; xlrec.prevblkno = prevblkno; xlrec.nextblkno = nextblkno; @@ -668,7 +669,15 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf, XLogRegisterBuffer(0, bucketbuf, flags); } - XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD); + /* + * If the bucket buffer and the primary buffer are the same, a cleanup + * lock is still required on the write buffer even if there no tuples + * to move, so it needs to be registered in this case. + */ + wbuf_flags = REGBUF_STANDARD; + if (xlrec.is_prim_bucket_same_wrt && xlrec.ntups == 0) + wbuf_flags |= REGBUF_NO_CHANGE; + XLogRegisterBuffer(1, wbuf, wbuf_flags); if (xlrec.ntups > 0) { XLogRegisterBufData(1, (char *) itup_offsets, signature.asc Description: PGP signature
Re: Is this a problem in GenericXLogFinish()?
On Fri, Oct 27, 2023 at 5:45 AM Michael Paquier wrote: > > On Thu, Oct 26, 2023 at 09:40:09AM -0400, Robert Haas wrote: > > Because of this, it is possible for bucketbuf, prevbuf, and wbuf to be > > the same (your first scenario) but the second scenario you mention > > (nextbuf == wbuf) should be impossible. > > Okay.. > > > It seems to me that maybe we shouldn't even be registering wbuf or > > doing anything at all to it if there are no tuples that need moving. > > That would also require some adjustment of the redo routine. > > Hmm. So my question is: do we need the cleanup lock on the write > buffer even if there are no tuples, and even if primary bucket and the > write bucket are the same? > Yes, we need it to exclude any concurrent in-progress scans that could return incorrect tuples during bucket squeeze operation. -- With Regards, Amit Kapila.
Re: Is this a problem in GenericXLogFinish()?
On Thu, Oct 26, 2023 at 09:40:09AM -0400, Robert Haas wrote: > Because of this, it is possible for bucketbuf, prevbuf, and wbuf to be > the same (your first scenario) but the second scenario you mention > (nextbuf == wbuf) should be impossible. Okay.. > It seems to me that maybe we shouldn't even be registering wbuf or > doing anything at all to it if there are no tuples that need moving. > That would also require some adjustment of the redo routine. Hmm. So my question is: do we need the cleanup lock on the write buffer even if there are no tuples, and even if primary bucket and the write bucket are the same? I'd like to think that what you say is OK, still I am not completely sure after reading the lock assumptions in the hash README or 6d46f4783efe. A simpler thing would be to mark buffer 1 with REGBUF_NO_CHANGE when the primary and write buffers are the same if we expect the lock to always be taken, I guess.. I've noticed that the replay paths for XLOG_HASH_MOVE_PAGE_CONTENTS and XLOG_HASH_SQUEEZE_PAGE are similar with their page handlings (some copy-pastes?). A MOVE record should never have zero tuples, still the replay path assumes that this can be possible, so it could be simplified. -- Michael signature.asc Description: PGP signature
Re: Is this a problem in GenericXLogFinish()?
On Thu, Oct 26, 2023 at 3:31 AM Michael Paquier wrote: > Hmm. Looking at hash_xlog_squeeze_page(), it looks like wbuf is > expected to always be registered. So, you're right here: it should be > OK to be less aggressive with setting the page LSN on wbuf if ntups is > 0, but there's more to it? For example, it is possible that > bucketbuf, prevbuf and wbuf refer to the same buffer, causing > is_prim_bucket_same_wrt and is_prev_bucket_same_wrt to be both true. > Particularly, if nextbuf and wbuf are the same buffers, we finish by > registering twice the same buffer with two different IDs to perform > the tuple insertion and the opaque area updates in two steps. And > that's.. Err.. not really necessary? I mean, as far as I read this > code you could just reuse the buffer registered at ID 1 and update its > opaque area if is_prim_bucket_same_wrt is true? Read the comments for _hash_squeezebucket. Particularly, note this part: * Try to squeeze the tuples onto pages occurring earlier in the * bucket chain in an attempt to free overflow pages. When we start * the "squeezing", the page from which we start taking tuples (the * "read" page) is the last bucket in the bucket chain and the page * onto which we start squeezing tuples (the "write" page) is the * first page in the bucket chain. The read page works backward and * the write page works forward; the procedure terminates when the * read page and write page are the same page. Because of this, it is possible for bucketbuf, prevbuf, and wbuf to be the same (your first scenario) but the second scenario you mention (nextbuf == wbuf) should be impossible. It seems to me that maybe we shouldn't even be registering wbuf or doing anything at all to it if there are no tuples that need moving. That would also require some adjustment of the redo routine. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Is this a problem in GenericXLogFinish()?
On Wed, Oct 25, 2023 at 10:06:23PM -0700, Jeff Davis wrote: > Thank you for the report! I was able to reproduce and observe that the > buffer is not marked dirty. > > The call (hashovfl.c:671): > > XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD) > > is followed unconditionally by: > > PageSetLSN(BufferGetPage(wbuf), recptr) > > so if the Assert were not there, it would be setting the LSN on a page > that's not marked dirty. Therefore I think this is a bug, or at least > an interesting/unexpected case. > > Perhaps the registration and the PageSetLSN don't need to happen when > nitups==0? Hmm. Looking at hash_xlog_squeeze_page(), it looks like wbuf is expected to always be registered. So, you're right here: it should be OK to be less aggressive with setting the page LSN on wbuf if ntups is 0, but there's more to it? For example, it is possible that bucketbuf, prevbuf and wbuf refer to the same buffer, causing is_prim_bucket_same_wrt and is_prev_bucket_same_wrt to be both true. Particularly, if nextbuf and wbuf are the same buffers, we finish by registering twice the same buffer with two different IDs to perform the tuple insertion and the opaque area updates in two steps. And that's.. Err.. not really necessary? I mean, as far as I read this code you could just reuse the buffer registered at ID 1 and update its opaque area if is_prim_bucket_same_wrt is true? -- Michael signature.asc Description: PGP signature
Re: Is this a problem in GenericXLogFinish()?
On Thu, 2023-10-26 at 07:00 +0300, Alexander Lakhin wrote: > It looks like the buffer is not dirty in the problematic call. Thank you for the report! I was able to reproduce and observe that the buffer is not marked dirty. The call (hashovfl.c:671): XLogRegisterBuffer(1, wbuf, REGBUF_STANDARD) is followed unconditionally by: PageSetLSN(BufferGetPage(wbuf), recptr) so if the Assert were not there, it would be setting the LSN on a page that's not marked dirty. Therefore I think this is a bug, or at least an interesting/unexpected case. Perhaps the registration and the PageSetLSN don't need to happen when nitups==0? Regards, Jeff Davis
Re: Is this a problem in GenericXLogFinish()?
Hello hackers, 24.10.2023 03:45, Jeff Davis wrote: Committed. I've encountered a case with a hash index, when an assert added by the commit fails: CREATE TABLE t (i INT); CREATE INDEX hi ON t USING HASH (i); INSERT INTO t SELECT 1 FROM generate_series(1, 1000); BEGIN; INSERT INTO t SELECT 1 FROM generate_series(1, 1000); ROLLBACK; CHECKPOINT; VACUUM t; Leads to: Core was generated by `postgres: law regression [local] VACUUM '. Program terminated with signal SIGABRT, Aborted. warning: Section `.reg-xstate/211944' in core file too small. #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=139924569388864) at ./nptl/pthread_kill.c:44 44 ./nptl/pthread_kill.c: No such file or directory. (gdb) bt #0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=139924569388864) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=6, threadid=139924569388864) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=139924569388864, signo=signo@entry=6) at ./nptl/pthread_kill.c:89 #3 0x7f42b99ed476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #4 0x7f42b99d37f3 in __GI_abort () at ./stdlib/abort.c:79 #5 0x55f128e83f1b in ExceptionalCondition (conditionName=0x55f128f33520 "BufferIsExclusiveLocked(buffer) && BufferIsDirty(buffer)", fileName=0x55f128f333a8 "xloginsert.c", lineNumber=262) at assert.c:66 #6 0x55f1287edce3 in XLogRegisterBuffer (block_id=1 '\001', buffer=1808, flags=8 '\b') at xloginsert.c:262 #7 0x55f128742833 in _hash_freeovflpage (rel=0x7f42adb95c88, bucketbuf=1808, ovflbuf=1825, wbuf=1808, itups=0x7ffc3f18c010, itup_offsets=0x7ffc3f18bce0, tups_size=0x7ffc3f18ccd0, nitups=0, bstrategy=0x55f12a562820) at hashovfl.c:671 #8 0x55f128743567 in _hash_squeezebucket (rel=0x7f42adb95c88, bucket=6, bucket_blkno=7, bucket_buf=1808, bstrategy=0x55f12a562820) at hashovfl.c:1064 #9 0x55f12873ca2a in hashbucketcleanup (rel=0x7f42adb95c88, cur_bucket=6, bucket_buf=1808, bucket_blkno=7, bstrategy=0x55f12a562820, maxbucket=7, highmask=15, lowmask=7, tuples_removed=0x7ffc3f18fb28, num_index_tuples=0x7ffc3f18fb30, split_cleanup=false, callback=0x55f1289ba1de , callback_state=0x55f12a566778) at hash.c:921 #10 0x55f12873bfcf in hashbulkdelete (info=0x7ffc3f18fc30, stats=0x0, callback=0x55f1289ba1de , callback_state=0x55f12a566778) at hash.c:549 #11 0x55f128776fbb in index_bulk_delete (info=0x7ffc3f18fc30, istat=0x0, callback=0x55f1289ba1de , callback_state=0x55f12a566778) at indexam.c:709 #12 0x55f1289ba03d in vac_bulkdel_one_index (ivinfo=0x7ffc3f18fc30, istat=0x0, dead_items=0x55f12a566778) at vacuum.c:2480 #13 0x55f128771286 in lazy_vacuum_one_index (indrel=0x7f42adb95c88, istat=0x0, reltuples=-1, vacrel=0x55f12a4b9c30) at vacuumlazy.c:2768 #14 0x55f1287704a3 in lazy_vacuum_all_indexes (vacrel=0x55f12a4b9c30) at vacuumlazy.c:2338 #15 0x55f128770275 in lazy_vacuum (vacrel=0x55f12a4b9c30) at vacuumlazy.c:2256 ... It looks like the buffer is not dirty in the problematic call. Best regards, Alexander
Re: Is this a problem in GenericXLogFinish()?
On Mon, 2023-10-23 at 10:14 -0400, Robert Haas wrote: > I think that would be a bug. I think it would be OK to just change > the > page LSN and nothing else, but that requires calling > MarkBufferDirty() > at some point. If you only call PageSetLSN() and never > MarkBufferDirty(), that sounds messed up: either the former should be > skipped, or the latter should be added. We shouldn't modify a shared > buffer without marking it dirty. > In that case, I think REGBUF_NO_CHANGE is the right name for the flag. Committed. Regards, Jeff Davis
Re: Is this a problem in GenericXLogFinish()?
On Fri, Oct 20, 2023 at 12:28 PM Jeff Davis wrote: > I still have a question though: if a buffer is exclusive-locked, > unmodified and clean, and then the caller registers it and later does > PageSetLSN (just as if it were dirty), is that a definite bug? > > There are a couple callsites where the control flow is complex enough > that it's hard to be sure the buffer is always marked dirty before > being registered (like in log_heap_visible(), as I mentioned upthread). > But those callsites are all doing PageSetLSN, unlike the hash index > case. I think that would be a bug. I think it would be OK to just change the page LSN and nothing else, but that requires calling MarkBufferDirty() at some point. If you only call PageSetLSN() and never MarkBufferDirty(), that sounds messed up: either the former should be skipped, or the latter should be added. We shouldn't modify a shared buffer without marking it dirty. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Is this a problem in GenericXLogFinish()?
On Thu, 2023-10-19 at 16:12 -0400, Robert Haas wrote: > For what it's worth, though, I think it would be better to > just make these cases exceptions to your Assert OK, I'll probably commit something like v4 then. I still have a question though: if a buffer is exclusive-locked, unmodified and clean, and then the caller registers it and later does PageSetLSN (just as if it were dirty), is that a definite bug? There are a couple callsites where the control flow is complex enough that it's hard to be sure the buffer is always marked dirty before being registered (like in log_heap_visible(), as I mentioned upthread). But those callsites are all doing PageSetLSN, unlike the hash index case. Regards, Jeff Davis
Re: Is this a problem in GenericXLogFinish()?
On Tue, Oct 17, 2023 at 12:38 PM Jeff Davis wrote: > I meant: are those cleanup operations frequent enough that dirtying > those buffers in that case would matter? Honestly, I'm not sure. Probably not? I mean, hashbucketcleanup() seems to only be called during vacuum or a bucket split, and I don't think you can have super-frequent calls to _hash_freeovflpage() either. For what it's worth, though, I think it would be better to just make these cases exceptions to your Assert, as you did in the patch, rather than changing them to dirty the buffer. There doesn't seem to be enough upside to making the assert unconditional to justify changing stuff that might have a real-world performance cost ... even if we don't think it would amount to much. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Is this a problem in GenericXLogFinish()?
On Tue, 2023-10-17 at 08:41 -0400, Robert Haas wrote: > Sorry, I'm not sure I understand the question. Are you asking whether > dirtying buffers unnecessarily might be slower than not doing that? I meant: are those cleanup operations frequent enough that dirtying those buffers in that case would matter? Regards, Jeff Davis
Re: Is this a problem in GenericXLogFinish()?
On Mon, Oct 16, 2023 at 7:31 PM Jeff Davis wrote: > Another option might be to just change the hash indexing code to follow > the correct protocol, locking and calling MarkBufferDirty() in those 3 > call sites. Marking the buffer dirty is easy, but making sure that it's > locked might require some refactoring. Robert, would following the > right protocol here affect performance? Sorry, I'm not sure I understand the question. Are you asking whether dirtying buffers unnecessarily might be slower than not doing that? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Is this a problem in GenericXLogFinish()?
On Wed, 2023-10-11 at 14:53 +0300, Heikki Linnakangas wrote: > > The comment suggests that you don't need to hold an exclusive lock > when > you call this, but there's an assertion that you do. Reworded. > Is it a new requirement that you must hold an exclusive lock on the > buffer when you call XLogRegisterBuffer()? I think that's reasonable. Agreed. > > I'd suggest a comment here to explain what's wrong if someone hits > this > assertion. Something like "The buffer must be marked as dirty before > registering, unless REGBUF_CLEAN_OK is set to indicate that the > buffer > is not being modified". Done, with different wording. > > > If we commit this, ideally someone would look into the places where > > REGBUF_CLEAN_OK is used and make sure that it's not a bug, and > > perhaps > > refactor so that it would benefit from the Assert. But refactoring > > those places is outside of the scope of this patch. I don't think it makes sense to register a buffer that is perhaps clean and perhaps not. After registering a buffer and writing WAL, you need to PageSetLSN(), and that should only be done after MarkBufferDirty(), right? So if you need to condition PageSetLSN() on whether MarkBufferDirty() has happened, it would be trivial to use the same condition for XLogRegisterBuffer(). Therefore I changed the name back to REGBUF_NO_CHANGE, as you suggested originally. The hash indexing code knows it's not modifying the bucket buffer, doesn't mark it dirty, and doesn't set the LSN. I presume that's safe. However, there are quite a few places where XLogRegisterBuffer()+WAL+PageSetLSN() all happen, but it's not immediately obvious that all code paths to get there also MarkBufferDirty(). For instanace: lazy_scan_heap() -- conditionally marks heapbuf as dirty visibilitymap_set() -- conditionally calls log_heap_visible log_heap_visible() XLogRegisterBuffer(1, heap_buffer, flags) if those conditions don't match up exactly, it seems we could get into a situation where we update the LSN of a clean page, which seems bad. There are a few other places in the hash code and spgist code where I have similar concerns. Not many, though, I looked at all of the call sites (at least briefly) and most of them are fine. > About that: you added the flag to a couple of XLogRegisterBuffer() > calls > in GIN, because they call MarkBufferDirty() only after > XLogRegisterBuffer(). Those would be easy to refactor so I'd suggest > doing that now. Done. > > It sounds like we have no intention to change the hash index code, > > so > > are we OK if this flag just lasts forever? Do you still think it > > offers > > a useful check? > > Yeah, I think this is a useful assertion. It might catch some bugs in > extensions too. I was asking more about the REGBUF_NO_CHANGE flag. It feels very specific to the hash indexes, and I'm not sure we want to encourage extensions to use it. Though maybe the locking protocol used by hash indexes is more generally applicable, and other indexing strategies might want to use it, too? Another option might be to just change the hash indexing code to follow the correct protocol, locking and calling MarkBufferDirty() in those 3 call sites. Marking the buffer dirty is easy, but making sure that it's locked might require some refactoring. Robert, would following the right protocol here affect performance? Regards, Jeff Davis From f0c2d489dd8f2268803f4d2cc8642e29cb0d3576 Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Thu, 28 Sep 2023 11:19:06 -0700 Subject: [PATCH v4] Assert that buffers are marked dirty before XLogRegisterBuffer(). Enforce the rule from transam/README in XLogRegisterBuffer(), and update callers to follow the rule. Hash indexes sometimes register clean pages as a part of the locking protocol, so provide a REGBUF_NO_CHANGE flag to support that use. Discussion: https://postgr.es/m/c84114f8-c7f1-5b57-f85a-3adc31e1a...@iki.fi --- src/backend/access/gin/ginbtree.c | 14 +++--- src/backend/access/gin/gindatapage.c| 6 +++ src/backend/access/gin/ginentrypage.c | 3 ++ src/backend/access/gin/ginfast.c| 5 ++- src/backend/access/hash/hash.c | 11 +++-- src/backend/access/hash/hashovfl.c | 21 ++--- src/backend/access/transam/xloginsert.c | 16 ++- src/backend/storage/buffer/bufmgr.c | 59 + src/include/access/xloginsert.h | 1 + src/include/storage/bufmgr.h| 2 + 10 files changed, 118 insertions(+), 20 deletions(-) diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c index 06e97a7fbb..fc694b40f1 100644 --- a/src/backend/access/gin/ginbtree.c +++ b/src/backend/access/gin/ginbtree.c @@ -387,24 +387,22 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack, START_CRIT_SECTION(); if (RelationNeedsWAL(btree->index) && !btree->isBuild) - { XLogBeginInsert(); - XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD); - if (BufferIsVa
Re: Is this a problem in GenericXLogFinish()?
On Wed, Oct 11, 2023 at 7:53 AM Heikki Linnakangas wrote: > > + * Buffer must be pinned and exclusive-locked. (If caller does not hold > > + * exclusive lock, then the result may be stale before it's returned.) > The comment suggests that you don't need to hold an exclusive lock when > you call this, but there's an assertion that you do. I don't think the comment suggests that. It would if you only read the sentence in parentheses. But if you read both of them it seems clear enough. I guess the parenthetical sentence cloud say "If the caller did not hold an exclusive lock, then the result might become stale even before it was returned," basically putting the whole thing in the subjunctive. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Is this a problem in GenericXLogFinish()?
On 10/10/2023 22:57, Jeff Davis wrote: On Thu, 2023-09-28 at 12:05 -0700, Jeff Davis wrote: Also, I ran into some problems with GIN that might require a bit more refactoring in ginPlaceToPage(). Perhaps we could consider REGBUF_NO_CHANGE a general bypass of the Assert(BufferIsDirty()), and use it temporarily until we have a chance to analyze/refactor each of the callers that need it. For the Assert, I have attached a new patch for v17. Thanks for working on this! +/* + * BufferIsDirty + * + * Checks if buffer is already dirty. + * + * Buffer must be pinned and exclusive-locked. (If caller does not hold + * exclusive lock, then the result may be stale before it's returned.) + */ +bool +BufferIsDirty(Buffer buffer) +{ + BufferDesc *bufHdr; + + if (BufferIsLocal(buffer)) + { + int bufid = -buffer - 1; + bufHdr = GetLocalBufferDescriptor(bufid); + } + else + { + bufHdr = GetBufferDescriptor(buffer - 1); + } + + Assert(BufferIsPinned(buffer)); + Assert(LWLockHeldByMeInMode(BufferDescriptorGetContentLock(bufHdr), + LW_EXCLUSIVE)); + + return pg_atomic_read_u32(&bufHdr->state) & BM_DIRTY; +} The comment suggests that you don't need to hold an exclusive lock when you call this, but there's an assertion that you do. Is it a new requirement that you must hold an exclusive lock on the buffer when you call XLogRegisterBuffer()? I think that's reasonable. I thought if that might be a problem when you WAL-log a page when you set hint bits on it and checksums are enabled. Hint bits can be set holding just a share lock. But it uses XLogSaveBufferForHint(), which calls XLogRegisterBlock() instead of XLogRegisterBuffer() There is a comment above XLogSaveBufferForHint() that implies that XLogRegisterBuffer() requires you to hold an exclusive-lock but I don't see that spelled out explicitly in XLogRegisterBuffer() itself. Maybe add an assertion for that in XLogRegisterBuffer() to make it more explicit. --- a/src/backend/access/transam/xloginsert.c +++ b/src/backend/access/transam/xloginsert.c @@ -246,6 +246,7 @@ XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags) /* NO_IMAGE doesn't make sense with FORCE_IMAGE */ Assert(!((flags & REGBUF_FORCE_IMAGE) && (flags & (REGBUF_NO_IMAGE; + Assert((flags & REGBUF_CLEAN_OK) || BufferIsDirty(buffer)); Assert(begininsert_called); if (block_id >= max_registered_block_id) I'd suggest a comment here to explain what's wrong if someone hits this assertion. Something like "The buffer must be marked as dirty before registering, unless REGBUF_CLEAN_OK is set to indicate that the buffer is not being modified". I changed the name of the flag to REGBUF_CLEAN_OK, because for some of the callsites it was not clear to me whether the page is always unchanged or sometimes unchanged. In other words, REGBUF_CLEAN_OK is a way to bypass the Assert for callsites where we either know that we actually want to register an unchanged page, or for callsites where we don't know if the page is changed or not. Ok. A flag to explicitly mark that the page is not modified would actually be nice for pg_rewind. It could ignore such page references. Not that it matters much in practice, since those records are so rare. And for that, we'd need to include the flag in the WAL record too. So I think this is fine. If we commit this, ideally someone would look into the places where REGBUF_CLEAN_OK is used and make sure that it's not a bug, and perhaps refactor so that it would benefit from the Assert. But refactoring those places is outside of the scope of this patch. About that: you added the flag to a couple of XLogRegisterBuffer() calls in GIN, because they call MarkBufferDirty() only after XLogRegisterBuffer(). Those would be easy to refactor so I'd suggest doing that now. It sounds like we have no intention to change the hash index code, so are we OK if this flag just lasts forever? Do you still think it offers a useful check? Yeah, I think this is a useful assertion. It might catch some bugs in extensions too. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Is this a problem in GenericXLogFinish()?
I committed and backported 0001 (the GenericXLogFinish() fix but not the Assert). Strangely I didn't see the -committers email come through yet. If anyone notices anything wrong, please let me know before the final v11 release. On Thu, 2023-09-28 at 12:05 -0700, Jeff Davis wrote: > Also, I ran into some problems with GIN that might require a bit more > refactoring in ginPlaceToPage(). Perhaps we could consider > REGBUF_NO_CHANGE a general bypass of the Assert(BufferIsDirty()), and > use it temporarily until we have a chance to analyze/refactor each of > the callers that need it. For the Assert, I have attached a new patch for v17. I changed the name of the flag to REGBUF_CLEAN_OK, because for some of the callsites it was not clear to me whether the page is always unchanged or sometimes unchanged. In other words, REGBUF_CLEAN_OK is a way to bypass the Assert for callsites where we either know that we actually want to register an unchanged page, or for callsites where we don't know if the page is changed or not. If we commit this, ideally someone would look into the places where REGBUF_CLEAN_OK is used and make sure that it's not a bug, and perhaps refactor so that it would benefit from the Assert. But refactoring those places is outside of the scope of this patch. It sounds like we have no intention to change the hash index code, so are we OK if this flag just lasts forever? Do you still think it offers a useful check? Regards, Jeff Davis From 859c8f4284de1030ab070630fa06af3c171264d9 Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Thu, 28 Sep 2023 11:19:06 -0700 Subject: [PATCH v3] Assert that buffers are marked dirty before XLogRegisterBuffer(). A page must be marked dirty before writing WAL, as documented in transam/README. Enforce this rule in XLogRegisterBuffer(), though it's not actually a bug if (a) the page really is clean; or (b) the buffer is marked dirty after XLogRegisterBuffer() and before XLogInsert(). Update log_newpage_range(), where it's trivial to change the order so that we can check in XLogRegisterBuffer(). Other callers are more complex, and updating them is outside the scope of this patch (and perhaps not desirable), so introduce a flag REGBUF_CLEAN_OK to bypass the Assert. Note that this commit may have missed some callsites which can, at least in some cases, register a clean buffer. If such a callsite is found, it can be updated to use REGBUF_CLEAN_OK assuming it doesn't represent an actual bug. Discussion: https://postgr.es/m/c84114f8-c7f1-5b57-f85a-3adc31e1a...@iki.fi --- src/backend/access/gin/ginbtree.c | 3 ++- src/backend/access/gin/ginfast.c| 2 +- src/backend/access/hash/hash.c | 10 ++--- src/backend/access/hash/hashovfl.c | 9 +--- src/backend/access/transam/xloginsert.c | 3 ++- src/backend/storage/buffer/bufmgr.c | 30 + src/include/access/xloginsert.h | 2 ++ src/include/storage/bufmgr.h| 1 + 8 files changed, 51 insertions(+), 9 deletions(-) diff --git a/src/backend/access/gin/ginbtree.c b/src/backend/access/gin/ginbtree.c index 06e97a7fbb..c98176af77 100644 --- a/src/backend/access/gin/ginbtree.c +++ b/src/backend/access/gin/ginbtree.c @@ -389,7 +389,8 @@ ginPlaceToPage(GinBtree btree, GinBtreeStack *stack, if (RelationNeedsWAL(btree->index) && !btree->isBuild) { XLogBeginInsert(); - XLogRegisterBuffer(0, stack->buffer, REGBUF_STANDARD); + XLogRegisterBuffer(0, stack->buffer, + REGBUF_STANDARD | REGBUF_CLEAN_OK); if (BufferIsValid(childbuf)) XLogRegisterBuffer(1, childbuf, REGBUF_STANDARD); } diff --git a/src/backend/access/gin/ginfast.c b/src/backend/access/gin/ginfast.c index eb6c554831..e46e11df07 100644 --- a/src/backend/access/gin/ginfast.c +++ b/src/backend/access/gin/ginfast.c @@ -399,7 +399,7 @@ ginHeapTupleFastInsert(GinState *ginstate, GinTupleCollector *collector) Assert((ptr - collectordata) <= collector->sumsize); if (needWal) { - XLogRegisterBuffer(1, buffer, REGBUF_STANDARD); + XLogRegisterBuffer(1, buffer, REGBUF_STANDARD | REGBUF_CLEAN_OK); XLogRegisterBufData(1, collectordata, collector->sumsize); } diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c index fc5d97f606..7a62100c89 100644 --- a/src/backend/access/hash/hash.c +++ b/src/backend/access/hash/hash.c @@ -824,11 +824,15 @@ hashbucketcleanup(Relation rel, Bucket cur_bucket, Buffer bucket_buf, XLogRegisterData((char *) &xlrec, SizeOfHashDelete); /* - * bucket buffer needs to be registered to ensure that we can - * acquire a cleanup lock on it during replay. + * bucket buffer was not changed, but still needs to be + * registered to ensure that we can acquire a cleanup lock on + * it during replay. */ if (!xlrec.is_primary_bucket_page) - XLogRegisterBuffer(0, bucket_buf, REGBUF_STANDARD | REGBUF_NO_IMAGE); +{ + uint8 flags = REGBUF_STANDARD | REGBUF_NO
Re: Is this a problem in GenericXLogFinish()?
On Wed, 2023-09-27 at 18:57 +0300, Heikki Linnakangas wrote: > We could define a new REGBUF_NO_CHANGE flag to signal that the buffer > is > registered just for locking purposes, and not actually modified by > the > WAL record. Out of curiosity I also added a trial assert (not intending to actually add this): Assert(!(flags & REGBUF_NO_CHANGE) || !BufferIsExclusiveLocked()); I expected that to fail, but it didn't -- perhaps that buffer is only locked during replay or something? I don't think we want that specific Assert; I was just experimenting with some cross-checks we could do to verify that REGBUF_NO_CHANGE is used properly. Also, I ran into some problems with GIN that might require a bit more refactoring in ginPlaceToPage(). Perhaps we could consider REGBUF_NO_CHANGE a general bypass of the Assert(BufferIsDirty()), and use it temporarily until we have a chance to analyze/refactor each of the callers that need it. Regards, Jeff Davis
Re: Is this a problem in GenericXLogFinish()?
On 27/09/2023 18:47, Robert Haas wrote: On Wed, Sep 27, 2023 at 11:03 AM Jeff Davis wrote: So it looks like it's intentionally registering a clean buffer so that it can take a cleanup lock for reasons other than cleaning (or even modiying) the page. I would think that there's a better way of accomplishing that goal, so perhaps we can fix that? I had forgotten some of the details of how this works until you reminded me, but now that you've jarred my memory, I remember some of it. When Amit Kaplla and I were working on the project to add WAL-logging to hash indexes, we ran into some problems with concurrency control for individual buckets within the hash index. Before that project, this was handled using heavyweight locks, one per bucket. That got changed in 6d46f4783efe457f74816a75173eb23ed8930020 for the reasons explained in that commit message. Basically, instead of taking heavyweight locks, we started taking cleanup locks on the primary bucket pages. I always thought that was a little awkward, but I didn't quite see how to do better. I don't think that I gave much thought at the time to the consequence you've uncovered here, namely that it means we're sometimes locking one page (the primary bucket page) because we want to do something to another bucket page (some page in the linked list of pages that are part of that bucket) and for that to be safe, we need to lock out concurrent scans of that bucket. I guess I don't know of any easy fix here. :-( That seems OK. Maybe there would be a better way to do that, but there's nothing wrong with that approach per se. We could define a new REGBUF_NO_CHANGE flag to signal that the buffer is registered just for locking purposes, and not actually modified by the WAL record. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Is this a problem in GenericXLogFinish()?
On Wed, Sep 27, 2023 at 11:03 AM Jeff Davis wrote: > So it looks like it's intentionally registering a clean buffer so that > it can take a cleanup lock for reasons other than cleaning (or even > modiying) the page. I would think that there's a better way of > accomplishing that goal, so perhaps we can fix that? I had forgotten some of the details of how this works until you reminded me, but now that you've jarred my memory, I remember some of it. When Amit Kaplla and I were working on the project to add WAL-logging to hash indexes, we ran into some problems with concurrency control for individual buckets within the hash index. Before that project, this was handled using heavyweight locks, one per bucket. That got changed in 6d46f4783efe457f74816a75173eb23ed8930020 for the reasons explained in that commit message. Basically, instead of taking heavyweight locks, we started taking cleanup locks on the primary bucket pages. I always thought that was a little awkward, but I didn't quite see how to do better. I don't think that I gave much thought at the time to the consequence you've uncovered here, namely that it means we're sometimes locking one page (the primary bucket page) because we want to do something to another bucket page (some page in the linked list of pages that are part of that bucket) and for that to be safe, we need to lock out concurrent scans of that bucket. I guess I don't know of any easy fix here. :-( -- Robert Haas EDB: http://www.enterprisedb.com
Re: Is this a problem in GenericXLogFinish()?
On Wed, 2023-09-27 at 10:36 -0400, Robert Haas wrote: > On Tue, Sep 26, 2023 at 9:36 PM Jeff Davis wrote: > > That site is pretty trivial to fix, but there are also a couple > > places > > in hash.c and hashovfl.c that are registering a clean page and it's > > not > > clear to me exactly what's going on. > > Huh, I wonder if that's just a bug. Do you know where exactly? hashovfl.c:665 and hash.c:831 In both cases it's registering the bucket_buf, and has a comment like: /* * bucket buffer needs to be registered to ensure that we can * acquire a cleanup lock on it during replay. */ And various redo routines have comments like: /* * Ensure we have a cleanup lock on primary bucket page before we start * with the actual replay operation. This is to ensure that neither a * scan can start nor a scan can be already-in-progress during the replay * of this operation. If we allow scans during this operation, then they * can miss some records or show the same record multiple times. */ So it looks like it's intentionally registering a clean buffer so that it can take a cleanup lock for reasons other than cleaning (or even modiying) the page. I would think that there's a better way of accomplishing that goal, so perhaps we can fix that? Regards, Jeff Davis
Re: Is this a problem in GenericXLogFinish()?
On Tue, Sep 26, 2023 at 9:36 PM Jeff Davis wrote: > That site is pretty trivial to fix, but there are also a couple places > in hash.c and hashovfl.c that are registering a clean page and it's not > clear to me exactly what's going on. Huh, I wonder if that's just a bug. Do you know where exactly? -- Robert Haas EDB: http://www.enterprisedb.com
Re: Is this a problem in GenericXLogFinish()?
On Wed, 2023-09-27 at 00:14 +0300, Heikki Linnakangas wrote: > Looks correct. You now loop through all the block IDs three times, > however. I wonder if that is measurably slower, but even if it's not, > was there a reason you wanted to move the XLogRegisterBuffer() calls > to > a separate loop? I did so to correspond more closely to what's outlined in the README and in other places in the code, where marking the buffers dirty happens before XLogBeginInsert(). It didn't occur to me that one extra loop would matter, but I can combine them again. It would be a bit more concise to do the XLogBeginInsert() first (like before) and then register the buffers in the same loop that does the writes and marks the buffers dirty. Updated patch attached. > > Hmm, I'm sure there are exceptions but log_newpage_range() actually > seems to be doing the right thing; it calls MarkBufferDirty() before > XLogInsert(). It only calls it after XLogRegisterBuffer() though, and > I > concur that XLogRegisterBuffer() would be the logical place for the > assertion. We could tighten this up, require that you call > MarkBufferDirty() before XLogRegisterBuffer(), and fix all the > callers. That site is pretty trivial to fix, but there are also a couple places in hash.c and hashovfl.c that are registering a clean page and it's not clear to me exactly what's going on. Regards, Jeff Davis From 9548bb49865db3a04bbdda4c63df611142e22964 Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Tue, 26 Sep 2023 12:10:11 -0700 Subject: [PATCH v2] Fix GenericXLogFinish(). Mark the buffers dirty before writing WAL. Discussion: https://postgr.es/m/25104133-7df8-cae3-b9a2-1c0aaa1c0...@iki.fi --- src/backend/access/transam/generic_xlog.c | 53 ++- 1 file changed, 23 insertions(+), 30 deletions(-) diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c index 6c68191ca6..2d5ff4aa7c 100644 --- a/src/backend/access/transam/generic_xlog.c +++ b/src/backend/access/transam/generic_xlog.c @@ -347,6 +347,10 @@ GenericXLogFinish(GenericXLogState *state) START_CRIT_SECTION(); + /* + * Compute deltas if necessary, write changes to buffers, mark + * buffers dirty, and register changes. + */ for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++) { PageData *pageData = &state->pages[i]; @@ -359,41 +363,31 @@ GenericXLogFinish(GenericXLogState *state) page = BufferGetPage(pageData->buffer); pageHeader = (PageHeader) pageData->image; + /* delta not needed for a full page image */ + if (!(pageData->flags & GENERIC_XLOG_FULL_IMAGE)) +computeDelta(pageData, page, (Page) pageData->image); + + /* + * Apply the image, being careful to zero the "hole" between + * pd_lower and pd_upper in order to avoid divergence between + * actual page state and what replay would produce. + */ + memcpy(page, pageData->image, pageHeader->pd_lower); + memset(page + pageHeader->pd_lower, 0, + pageHeader->pd_upper - pageHeader->pd_lower); + memcpy(page + pageHeader->pd_upper, + pageData->image + pageHeader->pd_upper, + BLCKSZ - pageHeader->pd_upper); + + MarkBufferDirty(pageData->buffer); + if (pageData->flags & GENERIC_XLOG_FULL_IMAGE) { -/* - * A full-page image does not require us to supply any xlog - * data. Just apply the image, being careful to zero the - * "hole" between pd_lower and pd_upper in order to avoid - * divergence between actual page state and what replay would - * produce. - */ -memcpy(page, pageData->image, pageHeader->pd_lower); -memset(page + pageHeader->pd_lower, 0, - pageHeader->pd_upper - pageHeader->pd_lower); -memcpy(page + pageHeader->pd_upper, - pageData->image + pageHeader->pd_upper, - BLCKSZ - pageHeader->pd_upper); - XLogRegisterBuffer(i, pageData->buffer, REGBUF_FORCE_IMAGE | REGBUF_STANDARD); } else { -/* - * In normal mode, calculate delta and write it as xlog data - * associated with this page. - */ -computeDelta(pageData, page, (Page) pageData->image); - -/* Apply the image, with zeroed "hole" as above */ -memcpy(page, pageData->image, pageHeader->pd_lower); -memset(page + pageHeader->pd_lower, 0, - pageHeader->pd_upper - pageHeader->pd_lower); -memcpy(page + pageHeader->pd_upper, - pageData->image + pageHeader->pd_upper, - BLCKSZ - pageHeader->pd_upper); - XLogRegisterBuffer(i, pageData->buffer, REGBUF_STANDARD); XLogRegisterBufData(i, pageData->delta, pageData->deltaLen); } @@ -402,7 +396,7 @@ GenericXLogFinish(GenericXLogState *state) /* Insert xlog record */ lsn = XLogInsert(RM_GENERIC_ID, 0); - /* Set LSN and mark buffers dirty */ + /* Set LSN */ for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++) { PageData *pageData = &state->pages[i]; @@ -410,7 +404,6 @@ GenericXLogFinish(GenericXLogState *state) if (BufferIsInva
Re: Is this a problem in GenericXLogFinish()?
On 26/09/2023 22:32, Jeff Davis wrote: On Mon, 2023-09-25 at 13:04 +0300, Heikki Linnakangas wrote: Yes, that's a problem. Patch attached. I rearranged the code a bit to follow the expected pattern of: write, mark dirty, WAL, set LSN. I think computing the deltas could also be moved earlier, outside of the critical section, but I'm not sure that would be useful. Looks correct. You now loop through all the block IDs three times, however. I wonder if that is measurably slower, but even if it's not, was there a reason you wanted to move the XLogRegisterBuffer() calls to a separate loop? Do you have a suggestion for any kind of test addition, or should we just review carefully? I wish we had an assertion for that. XLogInsert() could assert that the page is already marked dirty, for example. Unfortunately that's not always the case, e.g. log_newpage_range(). Hmm, I'm sure there are exceptions but log_newpage_range() actually seems to be doing the right thing; it calls MarkBufferDirty() before XLogInsert(). It only calls it after XLogRegisterBuffer() though, and I concur that XLogRegisterBuffer() would be the logical place for the assertion. We could tighten this up, require that you call MarkBufferDirty() before XLogRegisterBuffer(), and fix all the callers. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Is this a problem in GenericXLogFinish()?
On Mon, 2023-09-25 at 13:04 +0300, Heikki Linnakangas wrote: > Yes, that's a problem. Patch attached. I rearranged the code a bit to follow the expected pattern of: write, mark dirty, WAL, set LSN. I think computing the deltas could also be moved earlier, outside of the critical section, but I'm not sure that would be useful. Do you have a suggestion for any kind of test addition, or should we just review carefully? > I wish we had an assertion for that. XLogInsert() could assert that > the > page is already marked dirty, for example. Unfortunately that's not always the case, e.g. log_newpage_range(). Regards, Jeff Davis From c54749a7939721f0a838115abce09967216cbc0f Mon Sep 17 00:00:00 2001 From: Jeff Davis Date: Tue, 26 Sep 2023 12:10:11 -0700 Subject: [PATCH v1] Fix GenericXLogFinish(). Mark the buffers dirty before writing WAL. Discussion: https://postgr.es/m/25104133-7df8-cae3-b9a2-1c0aaa1c0...@iki.fi --- src/backend/access/transam/generic_xlog.c | 66 --- 1 file changed, 34 insertions(+), 32 deletions(-) diff --git a/src/backend/access/transam/generic_xlog.c b/src/backend/access/transam/generic_xlog.c index 6c68191ca6..2b7e054ebe 100644 --- a/src/backend/access/transam/generic_xlog.c +++ b/src/backend/access/transam/generic_xlog.c @@ -343,10 +343,12 @@ GenericXLogFinish(GenericXLogState *state) if (state->isLogged) { /* Logged relation: make xlog record in critical section. */ - XLogBeginInsert(); - START_CRIT_SECTION(); + /* + * Compute deltas if necessary, write changes to buffers, and mark + * them dirty. + */ for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++) { PageData *pageData = &state->pages[i]; @@ -359,41 +361,42 @@ GenericXLogFinish(GenericXLogState *state) page = BufferGetPage(pageData->buffer); pageHeader = (PageHeader) pageData->image; + /* delta not needed for a full page image */ + if (!(pageData->flags & GENERIC_XLOG_FULL_IMAGE)) +computeDelta(pageData, page, (Page) pageData->image); + + /* + * Apply the image, being careful to zero the "hole" between + * pd_lower and pd_upper in order to avoid divergence between + * actual page state and what replay would produce. + */ + memcpy(page, pageData->image, pageHeader->pd_lower); + memset(page + pageHeader->pd_lower, 0, + pageHeader->pd_upper - pageHeader->pd_lower); + memcpy(page + pageHeader->pd_upper, + pageData->image + pageHeader->pd_upper, + BLCKSZ - pageHeader->pd_upper); + + MarkBufferDirty(pageData->buffer); + } + + XLogBeginInsert(); + + /* Register buffers */ + for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++) + { + PageData *pageData = &state->pages[i]; + + if (BufferIsInvalid(pageData->buffer)) +continue; + if (pageData->flags & GENERIC_XLOG_FULL_IMAGE) { -/* - * A full-page image does not require us to supply any xlog - * data. Just apply the image, being careful to zero the - * "hole" between pd_lower and pd_upper in order to avoid - * divergence between actual page state and what replay would - * produce. - */ -memcpy(page, pageData->image, pageHeader->pd_lower); -memset(page + pageHeader->pd_lower, 0, - pageHeader->pd_upper - pageHeader->pd_lower); -memcpy(page + pageHeader->pd_upper, - pageData->image + pageHeader->pd_upper, - BLCKSZ - pageHeader->pd_upper); - XLogRegisterBuffer(i, pageData->buffer, REGBUF_FORCE_IMAGE | REGBUF_STANDARD); } else { -/* - * In normal mode, calculate delta and write it as xlog data - * associated with this page. - */ -computeDelta(pageData, page, (Page) pageData->image); - -/* Apply the image, with zeroed "hole" as above */ -memcpy(page, pageData->image, pageHeader->pd_lower); -memset(page + pageHeader->pd_lower, 0, - pageHeader->pd_upper - pageHeader->pd_lower); -memcpy(page + pageHeader->pd_upper, - pageData->image + pageHeader->pd_upper, - BLCKSZ - pageHeader->pd_upper); - XLogRegisterBuffer(i, pageData->buffer, REGBUF_STANDARD); XLogRegisterBufData(i, pageData->delta, pageData->deltaLen); } @@ -402,7 +405,7 @@ GenericXLogFinish(GenericXLogState *state) /* Insert xlog record */ lsn = XLogInsert(RM_GENERIC_ID, 0); - /* Set LSN and mark buffers dirty */ + /* Set LSN */ for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++) { PageData *pageData = &state->pages[i]; @@ -410,7 +413,6 @@ GenericXLogFinish(GenericXLogState *state) if (BufferIsInvalid(pageData->buffer)) continue; PageSetLSN(BufferGetPage(pageData->buffer), lsn); - MarkBufferDirty(pageData->buffer); } END_CRIT_SECTION(); } -- 2.34.1
Re: Is this a problem in GenericXLogFinish()?
On 22/09/2023 23:52, Jeff Davis wrote: src/backend/transam/README says: ... 4. Mark the shared buffer(s) as dirty with MarkBufferDirty(). (This must happen before the WAL record is inserted; see notes in SyncOneBuffer().) ... But GenericXLogFinish() does this: ... /* Insert xlog record */ lsn = XLogInsert(RM_GENERIC_ID, 0); /* Set LSN and mark buffers dirty */ for (i = 0; i < MAX_GENERIC_XLOG_PAGES; i++) { PageData *pageData = &state->pages[i]; if (BufferIsInvalid(pageData->buffer)) continue; PageSetLSN(BufferGetPage(pageData->buffer), lsn); MarkBufferDirty(pageData->buffer); } END_CRIT_SECTION(); Am I missing something or is that a problem? Yes, that's a problem. I wish we had an assertion for that. XLogInsert() could assert that the page is already marked dirty, for example. -- Heikki Linnakangas Neon (https://neon.tech)