Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
On Thu, Nov 9, 2017 at 8:25 AM, Asim Praveen wrote: > Indeed, the assertion tripped during WAL replay on the standby. This was > caught by TAP tests under src/test/recovery. The assertion is now fixed so > that WAL replay is exempt from the check. Please find the new patch > attached. The tests now pass with the fix. I also manually verified that > recovery works with "wal_consistency_checking=all". I still have a bad feeling about that bit... Still, it does not change the fact that patch 0001 in https://www.postgresql.org/message-id/CANXE4TccH_VjdKaHc9=KyH0Y7WORqZN+=mH5f=mp0bw3gzx...@mail.gmail.com needs a committer per the fact that it visibly fixes incorrect backend code and API contract. So I am switching the CF entry to ready for committer, but only for 0001. The other things could always be taken care of later. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
Hi Michael On Mon, Nov 6, 2017 at 6:18 PM, Michael Paquier wrote: > > Did you really test WAL replay? This still ignores that PageGetLSN is > as well taken in some code paths, like recovery, where actions on the > page are guaranteed to be serialized, like during recovery, so this > patch would cause the system to blow up. Note that pageinspect, > amcheck and wal_consistency_checking also process on page copies. So > the assertion failure of 0002 would trigger in those cases. > Indeed, the assertion tripped during WAL replay on the standby. This was caught by TAP tests under src/test/recovery. The assertion is now fixed so that WAL replay is exempt from the check. Please find the new patch attached. The tests now pass with the fix. I also manually verified that recovery works with "wal_consistency_checking=all". Asim 0002-PageGetLSN-assert-that-locks-are-properly-held.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
On Wed, Nov 8, 2017 at 2:26 AM, Jacob Champion wrote: > On Mon, Nov 6, 2017 at 6:18 PM, Michael Paquier > wrote: >> Did you really test WAL replay? > > Is there a way to test this other than installcheck-world? The only > failure we've run into at the moment is in the snapshot-too-old tests. > Maybe we're not configuring with all the tests enabled? Not automatically. The simplest method I use in this case is installcheck with a standby replaying changes behind. >>> The assertion added caught at least one code path where TestForOldSnapshot >>> calls PageGetLSN without holding any lock. The snapshot_too_old test in >>> "check-world" failed due to the assertion failure. This needs to be fixed, >>> see the open question in the opening mail on this thread. >> >> Good point. I am looping Kevin Grittner here for his input, as the >> author and committer of old_snapshot_threshold. Things can be >> addressed with a separate patch by roughly moving the check on >> PageGetLSN() to TestForOldSnapshot_impl() and use BufferGetLSNAtomic() >> instead. > > It still doesn't pass the tests, as not all code paths ensure that a > content lock is held before calling TestForOldSnapshot. > BufferGetLSNAtomic only adds the spinlock. I would prefer waiting for more input from Kevin here. This may prove to be a more invasive change. There are no objections into fixing the existing callers in index AMs though. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
On Tue, Nov 7, 2017 at 9:26 AM, Jacob Champion wrote: > On Mon, Nov 6, 2017 at 6:18 PM, Michael Paquier > wrote: >> It seems to me that 0001 is good for a committer lookup, that will get >> rid of all existing bugs. For 0002, what you are proposing is still >> not a good idea for anything using page copies. > > I think there is still significant confusion here. To be clear: this > patch is intended to add no changes for local page copies. Maybe a better way to put this is: we see no failures with the pageinspect regression tests, which exercise PageGetLSN() via the page_header() function. Are there other tests we should be paying attention to that might show a problem with our local-copy logic? --Jacob -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
Hi Michael, On Mon, Nov 6, 2017 at 6:18 PM, Michael Paquier wrote: > Did you really test WAL replay? Is there a way to test this other than installcheck-world? The only failure we've run into at the moment is in the snapshot-too-old tests. Maybe we're not configuring with all the tests enabled? >> The assertion added caught at least one code path where TestForOldSnapshot >> calls PageGetLSN without holding any lock. The snapshot_too_old test in >> "check-world" failed due to the assertion failure. This needs to be fixed, >> see the open question in the opening mail on this thread. > > Good point. I am looping Kevin Grittner here for his input, as the > author and committer of old_snapshot_threshold. Things can be > addressed with a separate patch by roughly moving the check on > PageGetLSN() to TestForOldSnapshot_impl() and use BufferGetLSNAtomic() > instead. It still doesn't pass the tests, as not all code paths ensure that a content lock is held before calling TestForOldSnapshot. BufferGetLSNAtomic only adds the spinlock. > The commit fest has lost view of this entry, and so did I. So I have > added a new entry: > https://commitfest.postgresql.org/16/1371/ Thank you! > BufferGetLSNAtomic() could really use LWLockHeldByMe(). Could you > consider it with an extra patch on top of 0001? The LWLockHeldByMe() calls are added to BufferGetLSNAtomic in patch 0002 (because it does all its work through PageGetLSN). > It seems to me that 0001 is good for a committer lookup, that will get > rid of all existing bugs. For 0002, what you are proposing is still > not a good idea for anything using page copies. I think there is still significant confusion here. To be clear: this patch is intended to add no changes for local page copies. As I've tried to say (three or four times now): to our understanding, local page copies are not allocated in the shared BufferBlocks region and are therefore not subject to the new assertions. Am I missing a corner case, or completely misunderstanding your point? I never got a direct response to my earlier questions on this. --Jacob -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
On Tue, Nov 7, 2017 at 7:27 AM, Asim Praveen wrote: > On Mon, Oct 2, 2017 at 6:48 PM, Michael Paquier > wrote: >> Jacob, here are some ideas to make this thread move on. I would >> suggest to produce a set of patches that do things incrementally: >> 1) One patch that changes the calls of PageGetLSN to >> BufferGetLSNAtomic which are now not appropriate. You have spotted >> some of them in the btree and gist code, but not all based on my first >> lookup. There is still one in gistFindCorrectParent and one in btree >> _bt_split. The monitoring of the other calls (sequence.c and >> vacuumlazy.c) looked safe. There is another one in XLogRecordAssemble >> that should be fixed I think. > > Thank you for your suggestions. Please find the first patch attached as > "0001-...". We verified both, gistFindCorrectParent and _bt_split and all > calls to PageGetLSN are made with exclusive lock on the buffer contents > held. Cool. Thanks for double-checking. XLogRecordAssemble() is fine after more lookup at this code, XLogRegisterBuffer already doing some sanity checks. >> 2) A second patch that strengthens a bit checks around >> BufferGetLSNAtomic. One idea would be to use LWLockHeldByMe, as you >> are originally suggesting. >> A comment could be as well added in bufpage.h for PageGetLSN to let >> users know that it should be used carefully, in the vein of what is >> mentioned in src/backend/access/transam/README. > > The second patch "0002-..." does the above. We have a comment added to > AssertPageIsLockedForLSN as suggested. Did you really test WAL replay? This still ignores that PageGetLSN is as well taken in some code paths, like recovery, where actions on the page are guaranteed to be serialized, like during recovery, so this patch would cause the system to blow up. Note that pageinspect, amcheck and wal_consistency_checking also process on page copies. So the assertion failure of 0002 would trigger in those cases. > The assertion added caught at least one code path where TestForOldSnapshot > calls PageGetLSN without holding any lock. The snapshot_too_old test in > "check-world" failed due to the assertion failure. This needs to be fixed, > see the open question in the opening mail on this thread. Good point. I am looping Kevin Grittner here for his input, as the author and committer of old_snapshot_threshold. Things can be addressed with a separate patch by roughly moving the check on PageGetLSN() to TestForOldSnapshot_impl() and use BufferGetLSNAtomic() instead. The commit fest has lost view of this entry, and so did I. So I have added a new entry: https://commitfest.postgresql.org/16/1371/ BufferGetLSNAtomic() could really use LWLockHeldByMe(). Could you consider it with an extra patch on top of 0001? It seems to me that 0001 is good for a committer lookup, that will get rid of all existing bugs. For 0002, what you are proposing is still not a good idea for anything using page copies. Here are some suggestions: - Implement a PageGetLSNFromCopy, dedicated at working correctly when working on a page copy. Then switch callers of amcheck, pageinspect and wal_consistency_checking to use that. - Implement something like GetLSNFromLockedPage, and switch of backend's PageGetLSN to that. Performance impact could be seen.. - Have a PageGetLSNSafe, which can be used safely for serialized actions. It could be an idea to remove PageGetLSN to force a breakage of extensions calling it, so as they would review any of its calls. Not a fan of that though. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
Hi Michael On Mon, Oct 2, 2017 at 6:48 PM, Michael Paquier wrote: > > Jacob, here are some ideas to make this thread move on. I would > suggest to produce a set of patches that do things incrementally: > 1) One patch that changes the calls of PageGetLSN to > BufferGetLSNAtomic which are now not appropriate. You have spotted > some of them in the btree and gist code, but not all based on my first > lookup. There is still one in gistFindCorrectParent and one in btree > _bt_split. The monitoring of the other calls (sequence.c and > vacuumlazy.c) looked safe. There is another one in XLogRecordAssemble > that should be fixed I think. Thank you for your suggestions. Please find the first patch attached as "0001-...". We verified both, gistFindCorrectParent and _bt_split and all calls to PageGetLSN are made with exclusive lock on the buffer contents held. > 2) A second patch that strengthens a bit checks around > BufferGetLSNAtomic. One idea would be to use LWLockHeldByMe, as you > are originally suggesting. > A comment could be as well added in bufpage.h for PageGetLSN to let > users know that it should be used carefully, in the vein of what is > mentioned in src/backend/access/transam/README. The second patch "0002-..." does the above. We have a comment added to AssertPageIsLockedForLSN as suggested. The assertion added caught at least one code path where TestForOldSnapshot calls PageGetLSN without holding any lock. The snapshot_too_old test in "check-world" failed due to the assertion failure. This needs to be fixed, see the open question in the opening mail on this thread. Asim and Jacob 0001-Change-incorrect-calls-to-PageGetLSN-to-BufferGetLSN.patch Description: Binary data 0002-PageGetLSN-assert-that-locks-are-properly-held.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
On Fri, Oct 6, 2017 at 7:34 PM, Alvaro Herrera wrote: > Just search for "Æsop" in the archives: > https://www.postgresql.org/message-id/CACjxUsPPCbov6DDPnuGpR=fmxhsjsn_mri3rjygvbrmcrff...@mail.gmail.com (laugh) I didn't know this one. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
Michael Paquier wrote: > On Tue, Oct 3, 2017 at 12:54 AM, Alvaro Herrera > wrote: > > I'd argue about this in the same direction I argued about > > BufferGetPage() needing an LSN check that's applied separately: if it's > > too easy for a developer to do the wrong thing (i.e. fail to apply the > > separate LSN check after reading the page) then the routine should be > > patched to do the check itself, and another routine should be offered > > for the cases that explicitly can do without the check. > > > > I was eventually outvoted in the BufferGetPage() saga, though, so I'm > > not sure that that discussion sets precedent. > > Oh... I don't recall this discussion. A quick lookup at the archives > does not show me a specific thread either. Just search for "Æsop" in the archives: https://www.postgresql.org/message-id/CACjxUsPPCbov6DDPnuGpR=fmxhsjsn_mri3rjygvbrmcrff...@mail.gmail.com -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
Michael Paquier wrote: > On Tue, Oct 3, 2017 at 12:54 AM, Alvaro Herrera > wrote: > > I'd argue about this in the same direction I argued about > > BufferGetPage() needing an LSN check that's applied separately: if it's > > too easy for a developer to do the wrong thing (i.e. fail to apply the > > separate LSN check after reading the page) then the routine should be > > patched to do the check itself, and another routine should be offered > > for the cases that explicitly can do without the check. > > > > I was eventually outvoted in the BufferGetPage() saga, though, so I'm > > not sure that that discussion sets precedent. > > Oh... I don't recall this discussion. A quick lookup at the archives > does not show me a specific thread either. I mean Kevin patch for snapshot-too-old: https://www.postgresql.org/message-id/CACjxUsPPCbov6DDPnuGpR%3DFmxHsjSn_MRi3rJYgvbRMCRfFz%2BA%40mail.gmail.com -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
On Tue, Oct 3, 2017 at 12:54 AM, Alvaro Herrera wrote: > Michael Paquier wrote: >> On Mon, Oct 2, 2017 at 11:44 PM, Robert Haas wrote: >> > On Mon, Oct 2, 2017 at 10:19 AM, Michael Paquier >> > wrote: >> >> So those bits could be considered for integration. >> > >> > Yes, and they also tend to suggest that the rest of the patch has value. >> >> Well, there are cases where you don't need any locking checks, and the >> proposed patch ignores that. Take for example pageinspect which works >> on a copy of a page, or just WAL replay which serializes everything, >> and in both cases PageGetLSN can be used safely. So for compatibility >> reasons I think that PageGetLSN should be kept alone to not surprise >> anybody using it, or at least an equivalent should be introduced. It >> would be interesting to make BufferGetLSNAtomic hold tighter checks, >> like something that makes use of LWLockHeldByMe for example. Jacob, here are some ideas to make this thread move on. I would suggest to produce a set of patches that do things incrementally: 1) One patch that changes the calls of PageGetLSN to BufferGetLSNAtomic which are now not appropriate. You have spotted some of them in the btree and gist code, but not all based on my first lookup. There is still one in gistFindCorrectParent and one in btree _bt_split. The monitoring of the other calls (sequence.c and vacuumlazy.c) looked safe. There is another one in XLogRecordAssemble that should be fixed I think. 2) A second patch that strengthens a bit checks around BufferGetLSNAtomic. One idea would be to use LWLockHeldByMe, as you are originally suggesting. A comment could be as well added in bufpage.h for PageGetLSN to let users know that it should be used carefully, in the vein of what is mentioned in src/backend/access/transam/README. > I'd argue about this in the same direction I argued about > BufferGetPage() needing an LSN check that's applied separately: if it's > too easy for a developer to do the wrong thing (i.e. fail to apply the > separate LSN check after reading the page) then the routine should be > patched to do the check itself, and another routine should be offered > for the cases that explicitly can do without the check. > > I was eventually outvoted in the BufferGetPage() saga, though, so I'm > not sure that that discussion sets precedent. Oh... I don't recall this discussion. A quick lookup at the archives does not show me a specific thread either. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
On Mon, Oct 2, 2017 at 11:05 AM, Michael Paquier wrote: > Well, there are cases where you don't need any locking checks, and the > proposed patch ignores that. I understand that, but shouldn't we then look for a way to adjust the patch so that it doesn't have that issue any longer, rather than just kicking it to the curb? I mean, just saying "patch suxxor, next" doesn't seem like the right approach to something that has apparently already found real problems. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
Michael Paquier wrote: > On Mon, Oct 2, 2017 at 11:44 PM, Robert Haas wrote: > > On Mon, Oct 2, 2017 at 10:19 AM, Michael Paquier > > wrote: > >> So those bits could be considered for integration. > > > > Yes, and they also tend to suggest that the rest of the patch has value. > > Well, there are cases where you don't need any locking checks, and the > proposed patch ignores that. Take for example pageinspect which works > on a copy of a page, or just WAL replay which serializes everything, > and in both cases PageGetLSN can be used safely. So for compatibility > reasons I think that PageGetLSN should be kept alone to not surprise > anybody using it, or at least an equivalent should be introduced. It > would be interesting to make BufferGetLSNAtomic hold tighter checks, > like something that makes use of LWLockHeldByMe for example. I'd argue about this in the same direction I argued about BufferGetPage() needing an LSN check that's applied separately: if it's too easy for a developer to do the wrong thing (i.e. fail to apply the separate LSN check after reading the page) then the routine should be patched to do the check itself, and another routine should be offered for the cases that explicitly can do without the check. I was eventually outvoted in the BufferGetPage() saga, though, so I'm not sure that that discussion sets precedent. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
On Mon, Oct 2, 2017 at 11:44 PM, Robert Haas wrote: > On Mon, Oct 2, 2017 at 10:19 AM, Michael Paquier > wrote: >> So those bits could be considered for integration. > > Yes, and they also tend to suggest that the rest of the patch has value. Well, there are cases where you don't need any locking checks, and the proposed patch ignores that. Take for example pageinspect which works on a copy of a page, or just WAL replay which serializes everything, and in both cases PageGetLSN can be used safely. So for compatibility reasons I think that PageGetLSN should be kept alone to not surprise anybody using it, or at least an equivalent should be introduced. It would be interesting to make BufferGetLSNAtomic hold tighter checks, like something that makes use of LWLockHeldByMe for example. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
On Mon, Oct 2, 2017 at 10:19 AM, Michael Paquier wrote: > So those bits could be considered for integration. Yes, and they also tend to suggest that the rest of the patch has value. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
On Mon, Oct 2, 2017 at 9:09 PM, Robert Haas wrote: > I think the first question we ought to be asking ourselves is whether > any of the PageGetLSN -> BufferGetLSNAtomic changes the patch > introduces are live bugs. If they are, then we ought to fix those > separately (and probably back-patch). If they are not, then we need > to think about how to adjust the patch so that it doesn't complain > about things that are in fact OK. If you look at each item one by one that the patch touches based on the contract defined in transam/README... --- a/src/backend/access/gist/gist.c +++ b/src/backend/access/gist/gist.c @@ -640,7 +640,7 @@ gistdoinsert(Relation r, IndexTuple itup, Size freespace, GISTSTATE *giststate) } stack->page = (Page) BufferGetPage(stack->buffer); - stack->lsn = PageGetLSN(stack->page); + stack->lsn = BufferGetLSNAtomic(stack->buffer); There is an incorrect use of PageGetLSN here. A shared lock can be taken on the page but there is no buffer header lock used when using PageGetLSN. @@ -890,7 +890,7 @@ gistFindPath(Relation r, BlockNumber child, OffsetNumber *downlinkoffnum) break; } - top->lsn = PageGetLSN(page); + top->lsn = BufferGetLSNAtomic(buffer); Here as well only a shared lock is taken on the page. @@ -61,7 +61,7 @@ gistkillitems(IndexScanDesc scan) * read. killedItems could be not valid so LP_DEAD hints applying is not * safe. */ - if (PageGetLSN(page) != so->curPageLSN) + if (BufferGetLSNAtomic(buffer) != so->curPageLSN) { UnlockReleaseBuffer(buffer); so->numKilled = 0; /* reset counter */ @@ -384,7 +384,7 @@ gistScanPage(IndexScanDesc scan, GISTSearchItem *pageItem, double *myDistances, * safe to apply LP_DEAD hints to the page later. This allows us to drop * the pin for MVCC scans, which allows vacuum to avoid blocking. */ - so->curPageLSN = PageGetLSN(page); + so->curPageLSN = BufferGetLSNAtomic(buffer); Those two as well only use a shared lock. @@ -249,7 +249,7 @@ gistbulkdelete(IndexVacuumInfo *info, IndexBulkDeleteResult *stats, ptr = (GistBDItem *) palloc(sizeof(GistBDItem)); ptr->blkno = ItemPointerGetBlockNumber(&(idxtuple->t_tid)); - ptr->parentlsn = PageGetLSN(page); + ptr->parentlsn = BufferGetLSNAtomic(buffer); ptr->next = stack->next; stack->next = ptr; Here also a shared lock is only taken, that's a VACUUM code path for Gist indexes. +++ b/src/backend/access/nbtree/nbtsearch.c @@ -1224,7 +1224,7 @@ _bt_readpage(IndexScanDesc scan, ScanDirection dir, OffsetNumber offnum) * safe to apply LP_DEAD hints to the page later. This allows us to drop * the pin for MVCC scans, which allows vacuum to avoid blocking. */ - so->currPos.lsn = PageGetLSN(page); + so->currPos.lsn = BufferGetLSNAtomic(so->currPos.buf); Here the caller of _bt_readpage should have taken a lock, but the page is not necessarily pinned. Still, _bt_getbuf makes sure that the pin is done. +++ b/src/backend/access/nbtree/nbtutils.c @@ -1772,7 +1772,7 @@ _bt_killitems(IndexScanDesc scan) return; page = BufferGetPage(buf); - if (PageGetLSN(page) == so->currPos.lsn) + if (BufferGetLSNAtomic(buf) == so->currPos.lsn) so->currPos.buf = buf; Same here thanks to _bt_getbuf. So those bits could be considered for integration. Also, if I read the gist code correctly, there is one other case in gistFindCorrectParent. And in btree, there is one occurence in _bt_split. In XLogRecordAssemble, there could be more checks regarding the locks taken on the buffer registered. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
On Mon, Sep 4, 2017 at 2:14 AM, Michael Paquier wrote: A>> that would trip it. The latter part is still in progress, because I'm > Well, PageGetLSN can be used in some hot code paths, xloginsert.c > being one, so it does not seem wise to me to switch it to something > more complicated than a macro, and also it is not bounded to any > locking contracts now. I don't see how turning it into a static inline function is worse. We've been doing a fair amount of that lately and it generally makes things better, not worse, often avoiding multiple evaluation hazards at the same time it generates better machine code. I find this patch sort of messy; in particular, the definition of AssertPageIsLockedForLSN tries to reverse-engineer a buffer ID from a Page, and that's sort of ugly. But I think the concept of trying to make sure that our code is adhering to necessary locking rules is a pretty good one. I think the first question we ought to be asking ourselves is whether any of the PageGetLSN -> BufferGetLSNAtomic changes the patch introduces are live bugs. If they are, then we ought to fix those separately (and probably back-patch). If they are not, then we need to think about how to adjust the patch so that it doesn't complain about things that are in fact OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
> On 20 Sep 2017, at 00:29, Jacob Champion wrote: > > On Wed, Sep 6, 2017 at 8:37 AM, Jacob Champion wrote: >> On Tue, Sep 5, 2017 at 10:49 PM, Michael Paquier >> wrote: >>> In short, it seems to me that this patch should be rejected in its >>> current shape. >> >> Is the half of the patch that switches PageGetLSN to >> BufferGetLSNAtomic correct, at least? > > Any further thoughts on this? If the BufferGetLSNAtomic fixes made > here are not correct to begin with, then the rest of the patch is > probably moot; I just want to double-check that that is the case. Based on the discussions in this thread, I’m marking this patch Returned with feedback. Please re-submit a new version in an upcoming commitfest when ready. cheers ./daniel -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
On Wed, Sep 6, 2017 at 8:37 AM, Jacob Champion wrote: > On Tue, Sep 5, 2017 at 10:49 PM, Michael Paquier > wrote: >> In short, it seems to me that this patch should be rejected in its >> current shape. > > Is the half of the patch that switches PageGetLSN to > BufferGetLSNAtomic correct, at least? Any further thoughts on this? If the BufferGetLSNAtomic fixes made here are not correct to begin with, then the rest of the patch is probably moot; I just want to double-check that that is the case. --Jacob -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
On Tue, Sep 5, 2017 at 10:49 PM, Michael Paquier wrote: > On Wed, Sep 6, 2017 at 1:15 AM, Jacob Champion wrote: >> On Sun, Sep 3, 2017 at 11:14 PM, Michael Paquier >> wrote: >>> +#if defined(USE_ASSERT_CHECKING) && !defined(FRONTEND) >>> +void >>> +AssertPageIsLockedForLSN(Page page) >>> +{ >>> From a design point of view, bufmgr.c is an API interface for buffer >>> management on the backend-size, so just using FRONTEND there looks >>> wrong to me (bufmgr.h is already bad enough IMO now). >> >> The comments suggested that bufmgr.h could be incidentally pulled into >> frontend code through other headers. Or do you mean that I don't need >> to check for FRONTEND in the C source file (since, I assume, it's only >> ever being compiled for the backend)? I'm not sure what you mean by >> "bufmgr.h is already bad enough". > > I mean that this should not become worse without a better reason. This > patch makes it so. > >>> This bit in src/backend/access/transam/README may interest you: >>> Note that we must only use PageSetLSN/PageGetLSN() when we know the action >>> is serialised. Only Startup process may modify data blocks during recovery, >>> so Startup process may execute PageGetLSN() without fear of serialisation >>> problems. All other processes must only call PageSet/GetLSN when holding >>> either an exclusive buffer lock or a shared lock plus buffer header lock, >>> or be writing the data block directly rather than through shared buffers >>> while holding AccessExclusiveLock on the relation. >>> >>> So I see no problem with the exiting caller. >> >> Is heap_xlog_visible only called during startup? > > Recovery is more exact. When replaying a XLOG_HEAP2_VISIBLE record. > >> If so, is there a >> general rule for which locks are required during startup and which >> aren't? > > You can surely say that a process is fine to read a variable without a > lock if it is the only one updating it, other processes would need a > lock to read a variable. In this case the startup process is the only > one updating blocks, so that's at least one code path where the is no > need to take a lock when looking at a page LSN with PageGetLSN. Is there a way to programmatically determine whether or not we're in such a situation? That could be added to the assertion. The code here is pretty inconsistent on whether or not PageGetLSN is called inside or outside a lock -- see XLogReadBufferForRedoExtended for instance. > Another example is pageinspect which works on a copy of a page and > uses PageGetLSN, so no direct locks on the buffer page is needed. And again -- this case should be correctly handled by the new assertion. Copies of pages are not checked for locks; we can't recover a buffer header from a page that isn't part of the BufferBlocks array. > In short, it seems to me that this patch should be rejected in its > current shape. Is the half of the patch that switches PageGetLSN to BufferGetLSNAtomic correct, at least? > There is no need to put more constraints on a API which > does not need more constraints and which is used widely by extensions. The goal here is not to add more constraints, but to verify that the existing constraints are followed. Perhaps this patch doesn't do that well/correctly, but I want to make sure we're on the same page regarding the end goal. --Jacob -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
On Wed, Sep 6, 2017 at 1:15 AM, Jacob Champion wrote: > On Sun, Sep 3, 2017 at 11:14 PM, Michael Paquier > wrote: >> +#if defined(USE_ASSERT_CHECKING) && !defined(FRONTEND) >> +void >> +AssertPageIsLockedForLSN(Page page) >> +{ >> From a design point of view, bufmgr.c is an API interface for buffer >> management on the backend-size, so just using FRONTEND there looks >> wrong to me (bufmgr.h is already bad enough IMO now). > > The comments suggested that bufmgr.h could be incidentally pulled into > frontend code through other headers. Or do you mean that I don't need > to check for FRONTEND in the C source file (since, I assume, it's only > ever being compiled for the backend)? I'm not sure what you mean by > "bufmgr.h is already bad enough". I mean that this should not become worse without a better reason. This patch makes it so. >> This bit in src/backend/access/transam/README may interest you: >> Note that we must only use PageSetLSN/PageGetLSN() when we know the action >> is serialised. Only Startup process may modify data blocks during recovery, >> so Startup process may execute PageGetLSN() without fear of serialisation >> problems. All other processes must only call PageSet/GetLSN when holding >> either an exclusive buffer lock or a shared lock plus buffer header lock, >> or be writing the data block directly rather than through shared buffers >> while holding AccessExclusiveLock on the relation. >> >> So I see no problem with the exiting caller. > > Is heap_xlog_visible only called during startup? Recovery is more exact. When replaying a XLOG_HEAP2_VISIBLE record. > If so, is there a > general rule for which locks are required during startup and which > aren't? You can surely say that a process is fine to read a variable without a lock if it is the only one updating it, other processes would need a lock to read a variable. In this case the startup process is the only one updating blocks, so that's at least one code path where the is no need to take a lock when looking at a page LSN with PageGetLSN. Another example is pageinspect which works on a copy of a page and uses PageGetLSN, so no direct locks on the buffer page is needed. In short, it seems to me that this patch should be rejected in its current shape. There is no need to put more constraints on a API which does not need more constraints and which is used widely by extensions. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
Hi Michael, thanks for the review! On Sun, Sep 3, 2017 at 11:14 PM, Michael Paquier wrote: >> While working on checksum support for GPDB, we noticed that several >> callers of PageGetLSN() didn't follow the correct locking procedure. > > Well, PageGetLSN can be used in some hot code paths, xloginsert.c > being one, so it does not seem wise to me to switch it to something > more complicated than a macro, If raw perf is an issue -- now that I've refactored the assertion into its own function, I could probably push the implementation back into a macro. Something like #define PageGetLSN(page) (AssertPageIsLockedForLSN((Page) page), PageXLogRecPtrGet(((PageHeader) page)->pd_lsn)) Callers would need to watch out for the double-evaluation, but that's already documented with this group of macros. > and also it is not bounded to any > locking contracts now. For example, pageinspect works on a copy of the > buffer, so that's one case where we just don't care. So we should it > be tightened in Postgres? The proposed patch doesn't tighten the contract in this case -- if the buffer that's passed to PageGetLSN() isn't a shared buffer (i.e. if it's not part of the BufferBlocks array), the assertion passes unconditionally. > That's the portion of the proposal I don't > get. You could introduce a custom API for this purpose, isn't the > origin of your problems with GPDB for checksums that data is > replicated across many nodes so things need to be locked uniquely? GPDB introduces additional problems, but I think this issue is independent of anything GPDB-specific. One way to look at it: are the changes in the proposed patch, from PageGetLSN to BufferGetLSNAtomic, correct? If not, well, that's a good discussion to have. But if they are correct, then why were they missed? and is there a way to help future contributors out in the future? That's the goal of this patch; I don't think it's something that a custom API solves. >> - This change introduces a subtle source incompatibility: whereas >> previously both Pages and PageHeaders could be passed to PageGetLSN(), >> now callers must explicitly cast to Page if they don't already have >> one. > > That would produce warnings for extensions, so people would likely > catch that when compiling with a newer version, if they use -Werror... > >> - Is my use of FRONTEND here correct? The documentation made it seem >> like some compilers could try to link against the >> AssertPageIsLockedForLSN function, even if PageGetLSN was never >> called. > > +#if defined(USE_ASSERT_CHECKING) && !defined(FRONTEND) > +void > +AssertPageIsLockedForLSN(Page page) > +{ > From a design point of view, bufmgr.c is an API interface for buffer > management on the backend-size, so just using FRONTEND there looks > wrong to me (bufmgr.h is already bad enough IMO now). The comments suggested that bufmgr.h could be incidentally pulled into frontend code through other headers. Or do you mean that I don't need to check for FRONTEND in the C source file (since, I assume, it's only ever being compiled for the backend)? I'm not sure what you mean by "bufmgr.h is already bad enough". >> - Use of PageGetLSN() in heapam.c seems to be incorrect, but I don't >> really know what the best way is to fix it. It explicitly unlocks the >> buffer, with the comment that visibilitymap_set() needs to handle >> locking itself, but then calls PageGetLSN() to determine whether >> visibilitymap_set() needs to be called. > > This bit in src/backend/access/transam/README may interest you: > Note that we must only use PageSetLSN/PageGetLSN() when we know the action > is serialised. Only Startup process may modify data blocks during recovery, > so Startup process may execute PageGetLSN() without fear of serialisation > problems. All other processes must only call PageSet/GetLSN when holding > either an exclusive buffer lock or a shared lock plus buffer header lock, > or be writing the data block directly rather than through shared buffers > while holding AccessExclusiveLock on the relation. > > So I see no problem with the exiting caller. Is heap_xlog_visible only called during startup? If so, is there a general rule for which locks are required during startup and which aren't? Currently there is no exception in the assertion made for startup processes, so that's something that would need to be changed. Is there a way to determine whether the current process considers itself to be a startup process? Thanks again! --Jacob -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
On Fri, Sep 1, 2017 at 3:53 AM, Jacob Champion wrote: > The patch is really two pieces: add the assertion, and fix the callers > that would trip it. The latter part is still in progress, because I'm > running into some places where I'm not sure what the correct way > forward is. I looked at your patch. > While working on checksum support for GPDB, we noticed that several > callers of PageGetLSN() didn't follow the correct locking procedure. Well, PageGetLSN can be used in some hot code paths, xloginsert.c being one, so it does not seem wise to me to switch it to something more complicated than a macro, and also it is not bounded to any locking contracts now. For example, pageinspect works on a copy of the buffer, so that's one case where we just don't care. So we should it be tightened in Postgres? That's the portion of the proposal I don't get. You could introduce a custom API for this purpose, isn't the origin of your problems with GPDB for checksums that data is replicated across many nodes so things need to be locked uniquely? > - This change introduces a subtle source incompatibility: whereas > previously both Pages and PageHeaders could be passed to PageGetLSN(), > now callers must explicitly cast to Page if they don't already have > one. That would produce warnings for extensions, so people would likely catch that when compiling with a newer version, if they use -Werror... > - Is my use of FRONTEND here correct? The documentation made it seem > like some compilers could try to link against the > AssertPageIsLockedForLSN function, even if PageGetLSN was never > called. +#if defined(USE_ASSERT_CHECKING) && !defined(FRONTEND) +void +AssertPageIsLockedForLSN(Page page) +{ >From a design point of view, bufmgr.c is an API interface for buffer management on the backend-size, so just using FRONTEND there looks wrong to me (bufmgr.h is already bad enough IMO now). > - Use of PageGetLSN() in heapam.c seems to be incorrect, but I don't > really know what the best way is to fix it. It explicitly unlocks the > buffer, with the comment that visibilitymap_set() needs to handle > locking itself, but then calls PageGetLSN() to determine whether > visibilitymap_set() needs to be called. This bit in src/backend/access/transam/README may interest you: Note that we must only use PageSetLSN/PageGetLSN() when we know the action is serialised. Only Startup process may modify data blocks during recovery, so Startup process may execute PageGetLSN() without fear of serialisation problems. All other processes must only call PageSet/GetLSN when holding either an exclusive buffer lock or a shared lock plus buffer header lock, or be writing the data block directly rather than through shared buffers while holding AccessExclusiveLock on the relation. So I see no problem with the exiting caller. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
On Fri, Sep 1, 2017 at 12:24 PM, Jacob Champion wrote: > On Fri, Sep 1, 2017 at 9:12 AM, Robert Haas wrote: >> It's a good idea to add patches to commitfest.postgresql.org > > Hi Robert, I added it yesterday as > https://commitfest.postgresql.org/14/1279/ . Let me know if I need to > touch anything up there. Nah, looks fine, I just somehow missed it when I looked the first time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
On Fri, Sep 1, 2017 at 9:12 AM, Robert Haas wrote: > It's a good idea to add patches to commitfest.postgresql.org Hi Robert, I added it yesterday as https://commitfest.postgresql.org/14/1279/ . Let me know if I need to touch anything up there. Thanks! --Jacob -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
On Thu, Aug 31, 2017 at 2:53 PM, Jacob Champion wrote: > While working on checksum support for GPDB, we noticed that several > callers of PageGetLSN() didn't follow the correct locking procedure. > To try to help ferret out present and future mistakes, we added an > assertion to PageGetLSN() that checks whether those locks were being > held correctly. This patch is a first-draft attempt to port that > assertion back up to postgres master, based on work by Asim Praveen, > Ashwin Agrawal, and myself. > > The patch is really two pieces: add the assertion, and fix the callers > that would trip it. The latter part is still in progress, because I'm > running into some places where I'm not sure what the correct way > forward is. > > (I'm a newbie to this list and this code base, so please don't > hesitate to correct me on anything, whether that's code- or > culture-related!) It's a good idea to add patches to commitfest.postgresql.org -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [PATCH] Assert that the correct locks are held when calling PageGetLSN()
Hello all, While working on checksum support for GPDB, we noticed that several callers of PageGetLSN() didn't follow the correct locking procedure. To try to help ferret out present and future mistakes, we added an assertion to PageGetLSN() that checks whether those locks were being held correctly. This patch is a first-draft attempt to port that assertion back up to postgres master, based on work by Asim Praveen, Ashwin Agrawal, and myself. The patch is really two pieces: add the assertion, and fix the callers that would trip it. The latter part is still in progress, because I'm running into some places where I'm not sure what the correct way forward is. (I'm a newbie to this list and this code base, so please don't hesitate to correct me on anything, whether that's code- or culture-related!) = Notes/Observations = - This change introduces a subtle source incompatibility: whereas previously both Pages and PageHeaders could be passed to PageGetLSN(), now callers must explicitly cast to Page if they don't already have one. - I originally tried to put (and the GPDB patch succeeds in putting) the entire assertion implementation into PageGetLSN in bufpage.h. That seems unworkable here -- buf_internals.h is no longer publicized, and the assertion needs those internals to perform its checks. So I moved the assertion into its own helper function in bufmgr.c. If assertions are disabled, that helper declaration gets turned into an empty macro, so there shouldn't be a performance hit. = Open Questions = - Is my use of FRONTEND here correct? The documentation made it seem like some compilers could try to link against the AssertPageIsLockedForLSN function, even if PageGetLSN was never called. - Use of PageGetLSN() in heapam.c seems to be incorrect, but I don't really know what the best way is to fix it. It explicitly unlocks the buffer, with the comment that visibilitymap_set() needs to handle locking itself, but then calls PageGetLSN() to determine whether visibilitymap_set() needs to be called. - TestForOldSnapshot is also problematic. The function is called in places where only a shared lock is held, so it needs to hold the header spinlock at least some of the time, but we only pass the Page as an argument. I could do the same "find buffer descriptor from page pointer" logic that we utilize in the assertion implementation, but is there a better way? - Should I try to add this assertion to PageSetLSN() as well? We were focused mostly on the GetLSN side of things, since that was the more risky piece of our backport. PageSetLSN is called from many more places, and the review there would be much larger. = Known Issues = - This approach doesn't seem to work when checksums are disabled. In that case, BufferGetLSNAtomic isn't actually atomic, so it seems to always trip the assertion. I'm not sure of the best way to proceed -- is it really correct not to follow the locking contract if checksums are disabled? What do you think? Is this something worth pursuing? Any and all comments welcome. Thanks! --Jacob PageGetLSN-assert-locks-held.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers