Re: [HACKERS] Reviewing freeze map code
On Thu, Aug 4, 2016 at 3:24 AM, Andres Freund wrote: > Hi, > > On 2016-08-02 10:55:18 -0400, Noah Misch wrote: >> [Action required within 72 hours. This is a generic notification.] >> >> The above-described topic is currently a PostgreSQL 9.6 open item. Andres, >> since you committed the patch believed to have created it, you own this open >> item. > > Well kinda (it was a partial fix for something not originally by me), > but I'll deal with. Reading now, will commit tomorrow. Thanks. I kept meaning to get to this one, and failing to do so. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
Hi, On 2016-08-02 10:55:18 -0400, Noah Misch wrote: > [Action required within 72 hours. This is a generic notification.] > > The above-described topic is currently a PostgreSQL 9.6 open item. Andres, > since you committed the patch believed to have created it, you own this open > item. Well kinda (it was a partial fix for something not originally by me), but I'll deal with. Reading now, will commit tomorrow. Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Tue, Aug 02, 2016 at 02:10:29PM +0530, Amit Kapila wrote: > On Tue, Aug 2, 2016 at 11:19 AM, Noah Misch wrote: > > On Sat, Jul 23, 2016 at 01:25:55PM +0530, Amit Kapila wrote: > >> On Mon, Jul 18, 2016 at 2:03 PM, Andres Freund wrote: > >> > On 2016-07-18 10:02:52 +0530, Amit Kapila wrote: > >> >> Consider the below scenario. > >> >> > >> >> Vacuum > >> >> a. acquires a cleanup lock for page - 10 > >> >> b. busy in checking visibility of tuples > >> >> --assume, here it takes some time and in the meantime Session-1 > >> >> performs step (a) and (b) and start waiting in step- (c) > >> >> c. marks the page as all-visible (PageSetAllVisible) > >> >> d. unlockandrelease the buffer > >> >> > >> >> Session-1 > >> >> a. In heap_lock_tuple(), readbuffer for page-10 > >> >> b. check PageIsAllVisible(), found page is not all-visible, so didn't > >> >> acquire the visbilitymap_pin > >> >> c. LockBuffer in ExlusiveMode - here it will wait for vacuum to > >> >> release the lock > >> >> d. Got the lock, but now the page is marked as all-visible, so ideally > >> >> need to recheck the page and acquire the visibilitymap_pin > >> > > >> > So, I've tried pretty hard to reproduce that. While the theory above is > >> > sound, I believe the relevant code-path is essentially dead for SQL > >> > callable code, because we'll always hold a buffer pin before even > >> > entering heap_update/heap_lock_tuple. > >> > > >> > >> It is possible that we don't hold any buffer pin before entering > >> heap_update() and or heap_lock_tuple(). For heap_update(), it is > >> possible when it enters via simple_heap_update() path. For > >> heap_lock_tuple(), it is possible for ON CONFLICT DO Update statement > >> and may be others as well. > > > > This is currently listed as a 9.6 open item. Is it indeed a regression in > > 9.6, or do released versions have the same defect? If it is a 9.6 > > regression, > > do you happen to know which commit, or at least which feature, caused it? > > > > Commit eca0f1db is the reason for this specific issue. [Action required within 72 hours. This is a generic notification.] The above-described topic is currently a PostgreSQL 9.6 open item. Andres, since you committed the patch believed to have created it, you own this open item. If some other commit is more relevant or if this does not belong as a 9.6 open item, please let us know. Otherwise, please observe the policy on open item ownership[1] and send a status update within 72 hours of this message. Include a date for your subsequent status update. Testers may discover new open items at any time, and I want to plan to get them all fixed in advance of shipping 9.6rc1 next week. Consequently, I will appreciate your efforts toward speedy resolution. Thanks. [1] http://www.postgresql.org/message-id/20160527025039.ga447...@tornado.leadboat.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Tue, Aug 2, 2016 at 11:19 AM, Noah Misch wrote: > On Sat, Jul 23, 2016 at 01:25:55PM +0530, Amit Kapila wrote: >> On Mon, Jul 18, 2016 at 2:03 PM, Andres Freund wrote: >> > On 2016-07-18 10:02:52 +0530, Amit Kapila wrote: >> >> Consider the below scenario. >> >> >> >> Vacuum >> >> a. acquires a cleanup lock for page - 10 >> >> b. busy in checking visibility of tuples >> >> --assume, here it takes some time and in the meantime Session-1 >> >> performs step (a) and (b) and start waiting in step- (c) >> >> c. marks the page as all-visible (PageSetAllVisible) >> >> d. unlockandrelease the buffer >> >> >> >> Session-1 >> >> a. In heap_lock_tuple(), readbuffer for page-10 >> >> b. check PageIsAllVisible(), found page is not all-visible, so didn't >> >> acquire the visbilitymap_pin >> >> c. LockBuffer in ExlusiveMode - here it will wait for vacuum to >> >> release the lock >> >> d. Got the lock, but now the page is marked as all-visible, so ideally >> >> need to recheck the page and acquire the visibilitymap_pin >> > >> > So, I've tried pretty hard to reproduce that. While the theory above is >> > sound, I believe the relevant code-path is essentially dead for SQL >> > callable code, because we'll always hold a buffer pin before even >> > entering heap_update/heap_lock_tuple. >> > >> >> It is possible that we don't hold any buffer pin before entering >> heap_update() and or heap_lock_tuple(). For heap_update(), it is >> possible when it enters via simple_heap_update() path. For >> heap_lock_tuple(), it is possible for ON CONFLICT DO Update statement >> and may be others as well. > > This is currently listed as a 9.6 open item. Is it indeed a regression in > 9.6, or do released versions have the same defect? If it is a 9.6 regression, > do you happen to know which commit, or at least which feature, caused it? > Commit eca0f1db is the reason for this specific issue. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Sat, Jul 23, 2016 at 01:25:55PM +0530, Amit Kapila wrote: > On Mon, Jul 18, 2016 at 2:03 PM, Andres Freund wrote: > > On 2016-07-18 10:02:52 +0530, Amit Kapila wrote: > >> Consider the below scenario. > >> > >> Vacuum > >> a. acquires a cleanup lock for page - 10 > >> b. busy in checking visibility of tuples > >> --assume, here it takes some time and in the meantime Session-1 > >> performs step (a) and (b) and start waiting in step- (c) > >> c. marks the page as all-visible (PageSetAllVisible) > >> d. unlockandrelease the buffer > >> > >> Session-1 > >> a. In heap_lock_tuple(), readbuffer for page-10 > >> b. check PageIsAllVisible(), found page is not all-visible, so didn't > >> acquire the visbilitymap_pin > >> c. LockBuffer in ExlusiveMode - here it will wait for vacuum to > >> release the lock > >> d. Got the lock, but now the page is marked as all-visible, so ideally > >> need to recheck the page and acquire the visibilitymap_pin > > > > So, I've tried pretty hard to reproduce that. While the theory above is > > sound, I believe the relevant code-path is essentially dead for SQL > > callable code, because we'll always hold a buffer pin before even > > entering heap_update/heap_lock_tuple. > > > > It is possible that we don't hold any buffer pin before entering > heap_update() and or heap_lock_tuple(). For heap_update(), it is > possible when it enters via simple_heap_update() path. For > heap_lock_tuple(), it is possible for ON CONFLICT DO Update statement > and may be others as well. This is currently listed as a 9.6 open item. Is it indeed a regression in 9.6, or do released versions have the same defect? If it is a 9.6 regression, do you happen to know which commit, or at least which feature, caused it? Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Wed, Jul 27, 2016 at 3:24 AM, Robert Haas wrote: > On Sat, Jul 23, 2016 at 3:55 AM, Amit Kapila wrote: >> Attached patch fixes the problem for me. Note, I have not tried to >> reproduce the problem for heap_lock_updated_tuple_rec(), but I think >> if you are convinced with above cases, then we should have a similar >> check in it as well. > > I don't think this hunk is correct: > > +/* > + * If we didn't pin the visibility map page and the page has become > + * all visible, we'll have to unlock and re-lock. See > heap_lock_tuple > + * for details. > + */ > +if (vmbuffer == InvalidBuffer && > PageIsAllVisible(BufferGetPage(buf))) > +{ > +LockBuffer(buf, BUFFER_LOCK_UNLOCK); > +visibilitymap_pin(rel, block, &vmbuffer); > +LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); > +goto l4; > +} > > The code beginning at label l4 appears that the buffer is unlocked, > but this code leaves the buffer unlocked. Also, I don't see the point > of doing this test so far down in the function. Why not just recheck > *immediately* after taking the buffer lock? > Right, in this case we can recheck immediately after taking buffer lock, updated patch attached. In the passing by, I have noticed that heap_delete() doesn't do this unlocking, pinning of vm and locking at appropriate place. It just checks immediately after taking lock, whereas in the down code, it do unlock and lock the buffer again. I think we should do it as in attached patch (pin_vm_heap_delete-v1.patch). -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com pin_vm_lock_tuple-v2.patch Description: Binary data pin_vm_heap_delete-v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Sat, Jul 23, 2016 at 3:55 AM, Amit Kapila wrote: > Attached patch fixes the problem for me. Note, I have not tried to > reproduce the problem for heap_lock_updated_tuple_rec(), but I think > if you are convinced with above cases, then we should have a similar > check in it as well. I don't think this hunk is correct: +/* + * If we didn't pin the visibility map page and the page has become + * all visible, we'll have to unlock and re-lock. See heap_lock_tuple + * for details. + */ +if (vmbuffer == InvalidBuffer && PageIsAllVisible(BufferGetPage(buf))) +{ +LockBuffer(buf, BUFFER_LOCK_UNLOCK); +visibilitymap_pin(rel, block, &vmbuffer); +LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); +goto l4; +} The code beginning at label l4 appears that the buffer is unlocked, but this code leaves the buffer unlocked. Also, I don't see the point of doing this test so far down in the function. Why not just recheck *immediately* after taking the buffer lock? If you find out that you need the pin after all, then LockBuffer(buf, BUFFER_LOCK_UNLOCK); visibilitymap_pin(rel, block, &vmbuffer); LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); but *do not* go back to l4. Unless I'm missing something, putting this block further down, as you have it, buys nothing, because none of that intervening code can release the buffer lock without using goto to jump back to l4. +/* + * If we didn't pin the visibility map page and the page has become all + * visible while we were busy locking the buffer, or during some + * subsequent window during which we had it unlocked, we'll have to unlock + * and re-lock, to avoid holding the buffer lock across an I/O. That's a + * bit unfortunate, especially since we'll now have to recheck whether the + * tuple has been locked or updated under us, but hopefully it won't + * happen very often. + */ +if (vmbuffer == InvalidBuffer && PageIsAllVisible(page)) +{ +LockBuffer(*buffer, BUFFER_LOCK_UNLOCK); +visibilitymap_pin(relation, block, &vmbuffer); +LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); +goto l3; +} In contrast, this looks correct: l3 expects the buffer to be locked already, and the code above this point and below the point this logic can unlock and re-lock the buffer, potentially multiple times. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Mon, Jul 18, 2016 at 2:03 PM, Andres Freund wrote: > On 2016-07-18 10:02:52 +0530, Amit Kapila wrote: >> > >> >> Consider the below scenario. >> >> Vacuum >> a. acquires a cleanup lock for page - 10 >> b. busy in checking visibility of tuples >> --assume, here it takes some time and in the meantime Session-1 >> performs step (a) and (b) and start waiting in step- (c) >> c. marks the page as all-visible (PageSetAllVisible) >> d. unlockandrelease the buffer >> >> Session-1 >> a. In heap_lock_tuple(), readbuffer for page-10 >> b. check PageIsAllVisible(), found page is not all-visible, so didn't >> acquire the visbilitymap_pin >> c. LockBuffer in ExlusiveMode - here it will wait for vacuum to >> release the lock >> d. Got the lock, but now the page is marked as all-visible, so ideally >> need to recheck the page and acquire the visibilitymap_pin > > So, I've tried pretty hard to reproduce that. While the theory above is > sound, I believe the relevant code-path is essentially dead for SQL > callable code, because we'll always hold a buffer pin before even > entering heap_update/heap_lock_tuple. > It is possible that we don't hold any buffer pin before entering heap_update() and or heap_lock_tuple(). For heap_update(), it is possible when it enters via simple_heap_update() path. For heap_lock_tuple(), it is possible for ON CONFLICT DO Update statement and may be others as well. Let me also try to explain with a test for both the cases, if above is not clear enough. Case-1 for heap_update() --- Session-1 Create table t1(c1 int); Alter table t1 alter column c1 set default 10; --via debugger stop at StoreAttrDefault()/heap_update, while you are in heap_update(), note down the block number Session-2 vacuum (DISABLE_PAGE_SKIPPING) pg_attribute; -- In lazy_scan_heap(), stop at line (if (all_visible && !all_visible_according_to_vm))) for block number noted in Session-1. Session-1 In debugger, proceed and let it wait at lockbuffer, note that it will not pin the visibility map. Session-2 Set the visibility flag and complete the operation. Session-1 You will notice that it will attempt to unlock the buffer, pin the visibility map, lock the buffer again. Case-2 for heap_lock_tuple() Session-1 Create table i_conflict(c1 int, c2 char(100)); Create unique index idx_u on i_conflict(c1); Insert into i_conflict values(1,'aaa'); Insert into i_conflict values(1,'aaa') On Conflict (c1) Do Update Set c2='bbb'; -- via debugger, stop at line 385 in nodeModifyTable.c (In ExecInsert(), at if (onconflict == ONCONFLICT_UPDATE)). Session-2 - vacuum (DISABLE_PAGE_SKIPPING) i_conflict --stop before setting the all-visible flag Session-1 -- In debugger, proceed and let it wait at lockbuffer, note that it will not pin the visibility map. Session-2 --- Set the visibility flag and complete the operation. Session-1 -- PANIC: wrong buffer passed to visibilitymap_clear --this is problematic. Attached patch fixes the problem for me. Note, I have not tried to reproduce the problem for heap_lock_updated_tuple_rec(), but I think if you are convinced with above cases, then we should have a similar check in it as well. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com pin_vm_lock_tuple-v1.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Sat, Jul 16, 2016 at 10:08 AM, Andres Freund wrote: > On 2016-07-14 20:53:07 -0700, Andres Freund wrote: >> On 2016-07-13 23:06:07 -0700, Andres Freund wrote: >> > won't enter the branch, because HEAP_XMAX_LOCK_ONLY won't be set. Which >> > will leave t_ctid and HEAP_HOT_UPDATED set differently on the master and >> > standby / after crash recovery. I'm failing to see any harmful >> > consequences right now, but differences between master and standby are a >> > bad >> > thing. >> >> I think it's actually critical, because HEAP_HOT_UPDATED / >> HEAP_XMAX_LOCK_ONLY are used to terminate ctid chasing loops (like >> heap_hot_search_buffer()). > > I've pushed a quite heavily revised version of the first patch to > 9.1-master. I manually verified using pageinspect, gdb breakpoints and a > standby that xmax, infomask etc are set correctly (leading to finding > a4d357bf). As there's noticeable differences, especially 9.2->9.3, > between versions, I'd welcome somebody having a look at the commits. Waoh, man. Thanks! I have been just pinged this week end about a set up that likely has faced this exact problem in the shape of "tuple concurrently updated" with a node getting kill-9-ed by some framework because it did not finish its shutdown checkpoint after some time in some test which enforced it to do crash recovery. I have not been able to put my hands on the raw data to have a look at the flags set within those tuples but I got the string feeling that this is related to that. After a couple of rounds doing so, it was possible to see "tuple concurrently updated" errors for a relation that has few pages and a high update rate using 9.4. More seriously, I have spent some time looking at what you have pushed on each branch, and the fixes are looking correct to me. -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 2016-07-18 01:33:10 -0700, Andres Freund wrote: > On 2016-07-18 10:02:52 +0530, Amit Kapila wrote: > > On Mon, Jul 18, 2016 at 9:13 AM, Andres Freund wrote: > > > On 2016-07-18 09:07:19 +0530, Amit Kapila wrote: > > >> + /* > > >> + * Before locking the buffer, pin the visibility map page if it may be > > >> + * necessary. > > >> + */ > > >> > > >> + if (PageIsAllVisible(BufferGetPage(*buffer))) > > >> + visibilitymap_pin(relation, block, &vmbuffer); > > >> + > > >> LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); > > >> > > >> I think we need to check for PageIsAllVisible and try to pin the > > >> visibility map after taking the lock on buffer. I think it is quite > > >> possible that in the time this routine tries to acquire lock on > > >> buffer, the page becomes all visible. > > > > > > I don't see how. Without a cleanup lock it's not possible to mark a page > > > all-visible/frozen. > > > > > > > Consider the below scenario. > > > > Vacuum > > a. acquires a cleanup lock for page - 10 > > b. busy in checking visibility of tuples > > --assume, here it takes some time and in the meantime Session-1 > > performs step (a) and (b) and start waiting in step- (c) > > c. marks the page as all-visible (PageSetAllVisible) > > d. unlockandrelease the buffer > > > > Session-1 > > a. In heap_lock_tuple(), readbuffer for page-10 > > b. check PageIsAllVisible(), found page is not all-visible, so didn't > > acquire the visbilitymap_pin > > c. LockBuffer in ExlusiveMode - here it will wait for vacuum to > > release the lock > > d. Got the lock, but now the page is marked as all-visible, so ideally > > need to recheck the page and acquire the visibilitymap_pin > > So, I've tried pretty hard to reproduce that. While the theory above is > sound, I believe the relevant code-path is essentially dead for SQL > callable code, because we'll always hold a buffer pin before even > entering heap_update/heap_lock_tuple. It's possible that you could > concoct a dangerous scenario with follow_updates though; but I can't > immediately see how. Due to that, and based on the closing in beta > release, I'm planning to push a version of the patch that the returns > fixed; but not this. It seems better to have the majority of the fix > in. Pushed that way. Let's try to figure out a good solution to a) test this case b) how to fix it in a reasonable way. Note that there's also http://archives.postgresql.org/message-id/20160718071729.tlj4upxhaylwv75n%40alap3.anarazel.de which seems related. Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 2016-07-18 10:02:52 +0530, Amit Kapila wrote: > On Mon, Jul 18, 2016 at 9:13 AM, Andres Freund wrote: > > On 2016-07-18 09:07:19 +0530, Amit Kapila wrote: > >> + /* > >> + * Before locking the buffer, pin the visibility map page if it may be > >> + * necessary. > >> + */ > >> > >> + if (PageIsAllVisible(BufferGetPage(*buffer))) > >> + visibilitymap_pin(relation, block, &vmbuffer); > >> + > >> LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); > >> > >> I think we need to check for PageIsAllVisible and try to pin the > >> visibility map after taking the lock on buffer. I think it is quite > >> possible that in the time this routine tries to acquire lock on > >> buffer, the page becomes all visible. > > > > I don't see how. Without a cleanup lock it's not possible to mark a page > > all-visible/frozen. > > > > Consider the below scenario. > > Vacuum > a. acquires a cleanup lock for page - 10 > b. busy in checking visibility of tuples > --assume, here it takes some time and in the meantime Session-1 > performs step (a) and (b) and start waiting in step- (c) > c. marks the page as all-visible (PageSetAllVisible) > d. unlockandrelease the buffer > > Session-1 > a. In heap_lock_tuple(), readbuffer for page-10 > b. check PageIsAllVisible(), found page is not all-visible, so didn't > acquire the visbilitymap_pin > c. LockBuffer in ExlusiveMode - here it will wait for vacuum to > release the lock > d. Got the lock, but now the page is marked as all-visible, so ideally > need to recheck the page and acquire the visibilitymap_pin So, I've tried pretty hard to reproduce that. While the theory above is sound, I believe the relevant code-path is essentially dead for SQL callable code, because we'll always hold a buffer pin before even entering heap_update/heap_lock_tuple. It's possible that you could concoct a dangerous scenario with follow_updates though; but I can't immediately see how. Due to that, and based on the closing in beta release, I'm planning to push a version of the patch that the returns fixed; but not this. It seems better to have the majority of the fix in. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Mon, Jul 18, 2016 at 9:13 AM, Andres Freund wrote: > On 2016-07-18 09:07:19 +0530, Amit Kapila wrote: >> + /* >> + * Before locking the buffer, pin the visibility map page if it may be >> + * necessary. >> + */ >> >> + if (PageIsAllVisible(BufferGetPage(*buffer))) >> + visibilitymap_pin(relation, block, &vmbuffer); >> + >> LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); >> >> I think we need to check for PageIsAllVisible and try to pin the >> visibility map after taking the lock on buffer. I think it is quite >> possible that in the time this routine tries to acquire lock on >> buffer, the page becomes all visible. > > I don't see how. Without a cleanup lock it's not possible to mark a page > all-visible/frozen. > Consider the below scenario. Vacuum a. acquires a cleanup lock for page - 10 b. busy in checking visibility of tuples --assume, here it takes some time and in the meantime Session-1 performs step (a) and (b) and start waiting in step- (c) c. marks the page as all-visible (PageSetAllVisible) d. unlockandrelease the buffer Session-1 a. In heap_lock_tuple(), readbuffer for page-10 b. check PageIsAllVisible(), found page is not all-visible, so didn't acquire the visbilitymap_pin c. LockBuffer in ExlusiveMode - here it will wait for vacuum to release the lock d. Got the lock, but now the page is marked as all-visible, so ideally need to recheck the page and acquire the visibilitymap_pin -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 2016-07-18 09:07:19 +0530, Amit Kapila wrote: > + /* > + * Before locking the buffer, pin the visibility map page if it may be > + * necessary. > + */ > > + if (PageIsAllVisible(BufferGetPage(*buffer))) > + visibilitymap_pin(relation, block, &vmbuffer); > + > LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); > > I think we need to check for PageIsAllVisible and try to pin the > visibility map after taking the lock on buffer. I think it is quite > possible that in the time this routine tries to acquire lock on > buffer, the page becomes all visible. I don't see how. Without a cleanup lock it's not possible to mark a page all-visible/frozen. We might miss the bit becoming unset concurrently, but that's ok. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 2016-07-17 23:34:01 -0400, Robert Haas wrote: > Thanks very much for working on this. Random suggestions after a quick look: > > + * Before locking the buffer, pin the visibility map page if it may be > + * necessary. > > s/necessary/needed/ > > More substantively, what happens if the situation changes before we > obtain the buffer lock? I think you need to release the page lock, > pin the page after all, and then relock the page. It shouldn't be able to. Cleanup locks, which are required for vacuumlazy to do anything relevant, aren't possible with the buffer pinned. This pattern is used in heap_delete/heap_update, so I think we're on a reasonably well trodden path. > There seem to be several ways to escape from this function without > releasing the pin on vmbuffer. From the visibilitymap_pin call here, > search downward for "return". Hm, that's cleary not good. The best thing to address that seems to be to create a separate jump label, which check vmbuffer and releases the page lock. Unless you have a better idea. > + * visibilitymap_clear - clear bit(s) for one page in visibility map > > I don't really like the parenthesized-s convention as a shorthand for > "one or more". It tends to confuse non-native English speakers. > > + * any I/O. Returns whether any bits have been cleared. > > I suggest "Returns true if any bits have been cleared and false otherwise". Will change. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Mon, Jul 18, 2016 at 8:18 AM, Andres Freund wrote: > On 2016-07-16 10:45:26 -0700, Andres Freund wrote: >> >> >> On July 16, 2016 8:49:06 AM PDT, Tom Lane wrote: >> >Amit Kapila writes: >> >> On Sat, Jul 16, 2016 at 7:02 AM, Andres Freund >> >wrote: >> >>> I think we have two choices how to deal with that: First, we can add >> >a >> >>> new flags variable to xl_heap_lock similar to >> >>> xl_heap_insert/update/... and bump page magic, >> > >> >> +1 for going in this way. This will keep us consistent with how >> >clear >> >> the visibility info in other places like heap_xlog_update(). >> > >> >Yeah. We've already forced a catversion bump for beta3, and I'm about >> >to go fix PG_CONTROL_VERSION as well, so there's basically no downside >> >to doing an xlog version bump as well. At least, not if you can get it >> >in before Monday. >> >> OK, Cool. Will do it later today. > > Took till today. Attached is a rather heavily revised version of > Sawada-san's patch. Most notably the recovery routines take care to > reset the vm in all cases, we don't perform visibilitymap_get_status > from inside a critical section anymore, and > heap_lock_updated_tuple_rec() also resets the vm (although I'm not > entirely sure that can practically be hit). > @@ -4563,8 +4579,18 @@ heap_lock_tuple(Relation relation, HeapTuple tuple, + /* + * Before locking the buffer, pin the visibility map page if it may be + * necessary. + */ + if (PageIsAllVisible(BufferGetPage(*buffer))) + visibilitymap_pin(relation, block, &vmbuffer); + LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); I think we need to check for PageIsAllVisible and try to pin the visibility map after taking the lock on buffer. I think it is quite possible that in the time this routine tries to acquire lock on buffer, the page becomes all visible. To avoid the similar hazard, we do try to check the visibility of page after acquiring buffer lock in heap_update() at below place. if (vmbuffer == InvalidBuffer && PageIsAllVisible(page)) Similarly, I think heap_lock_updated_tuple_rec() needs to take care of same. While I was typing this e-mail, it seems Robert has already pointed the same issue. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Sun, Jul 17, 2016 at 10:48 PM, Andres Freund wrote: > Took till today. Attached is a rather heavily revised version of > Sawada-san's patch. Most notably the recovery routines take care to > reset the vm in all cases, we don't perform visibilitymap_get_status > from inside a critical section anymore, and > heap_lock_updated_tuple_rec() also resets the vm (although I'm not > entirely sure that can practically be hit). > > I'm doing some more testing, and Robert said he could take a quick look > at the patch. If somebody else... Will push sometime after dinner. Thanks very much for working on this. Random suggestions after a quick look: + * Before locking the buffer, pin the visibility map page if it may be + * necessary. s/necessary/needed/ More substantively, what happens if the situation changes before we obtain the buffer lock? I think you need to release the page lock, pin the page after all, and then relock the page. There seem to be several ways to escape from this function without releasing the pin on vmbuffer. From the visibilitymap_pin call here, search downward for "return". + * visibilitymap_clear - clear bit(s) for one page in visibility map I don't really like the parenthesized-s convention as a shorthand for "one or more". It tends to confuse non-native English speakers. + * any I/O. Returns whether any bits have been cleared. I suggest "Returns true if any bits have been cleared and false otherwise". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 2016-07-16 10:45:26 -0700, Andres Freund wrote: > > > On July 16, 2016 8:49:06 AM PDT, Tom Lane wrote: > >Amit Kapila writes: > >> On Sat, Jul 16, 2016 at 7:02 AM, Andres Freund > >wrote: > >>> I think we have two choices how to deal with that: First, we can add > >a > >>> new flags variable to xl_heap_lock similar to > >>> xl_heap_insert/update/... and bump page magic, > > > >> +1 for going in this way. This will keep us consistent with how > >clear > >> the visibility info in other places like heap_xlog_update(). > > > >Yeah. We've already forced a catversion bump for beta3, and I'm about > >to go fix PG_CONTROL_VERSION as well, so there's basically no downside > >to doing an xlog version bump as well. At least, not if you can get it > >in before Monday. > > OK, Cool. Will do it later today. Took till today. Attached is a rather heavily revised version of Sawada-san's patch. Most notably the recovery routines take care to reset the vm in all cases, we don't perform visibilitymap_get_status from inside a critical section anymore, and heap_lock_updated_tuple_rec() also resets the vm (although I'm not entirely sure that can practically be hit). I'm doing some more testing, and Robert said he could take a quick look at the patch. If somebody else... Will push sometime after dinner. Regards, Andres >From 26f6eff8cef9b436e328a7364d6e4954b702208b Mon Sep 17 00:00:00 2001 From: Andres Freund Date: Sun, 17 Jul 2016 19:30:38 -0700 Subject: [PATCH] Clear all-frozen visibilitymap status when locking tuples. Since a892234 & fd31cd265 the visibilitymap's freeze bit is used to avoid vacuuming the whole relation in anti-wraparound vacuums. Doing so correctly relies on not adding xids to the heap without also unsetting the visibilitymap flag. Tuple locking related code has not done so. To allow selectively resetting all-frozen - to avoid pessimizing heap_lock_tuple - allow to selectively reset the all-frozen with visibilitymap_clear(). To avoid having to use visibilitymap_get_status (e.g. via VM_ALL_FROZEN) inside a critical section, have visibilitymap_clear() return whether any bits have been reset. The added flags field fields to xl_heap_lock and xl_heap_lock_updated require bumping the WAL magic. Since there's already been a catversion bump since the last beta, that's not an issue. Author: Masahiko Sawada, heavily revised by Andres Freund Discussion: CAEepm=3fwabwryvw9swhylty4sxvf0xblvxqowuodincx9m...@mail.gmail.com Backpatch: - --- src/backend/access/heap/heapam.c| 126 +--- src/backend/access/heap/visibilitymap.c | 18 +++-- src/backend/access/rmgrdesc/heapdesc.c | 6 +- src/backend/commands/vacuumlazy.c | 6 +- src/include/access/heapam_xlog.h| 9 ++- src/include/access/visibilitymap.h | 4 +- src/include/access/xlog_internal.h | 2 +- 7 files changed, 145 insertions(+), 26 deletions(-) diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 2815d91..1216f3f 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -2423,7 +2423,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid, PageClearAllVisible(BufferGetPage(buffer)); visibilitymap_clear(relation, ItemPointerGetBlockNumber(&(heaptup->t_self)), - vmbuffer); + vmbuffer, VISIBILITYMAP_VALID_BITS); } /* @@ -2737,7 +2737,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples, PageClearAllVisible(page); visibilitymap_clear(relation, BufferGetBlockNumber(buffer), -vmbuffer); +vmbuffer, VISIBILITYMAP_VALID_BITS); } /* @@ -3239,7 +3239,7 @@ l1: all_visible_cleared = true; PageClearAllVisible(page); visibilitymap_clear(relation, BufferGetBlockNumber(buffer), - vmbuffer); + vmbuffer, VISIBILITYMAP_VALID_BITS); } /* store transaction information of xact deleting the tuple */ @@ -3925,6 +3925,7 @@ l2: TransactionId xmax_lock_old_tuple; uint16 infomask_lock_old_tuple, infomask2_lock_old_tuple; + bool cleared_all_frozen = false; /* * To prevent concurrent sessions from updating the tuple, we have to @@ -3968,6 +3969,17 @@ l2: /* temporarily make it look not-updated, but locked */ oldtup.t_data->t_ctid = oldtup.t_self; + /* + * Clear all-frozen bit on visibility map if needed. We could + * immediately reset ALL_VISIBLE, but given that the WAL logging + * overhead would be unchanged, that doesn't seem necessarily + * worthwhile. + */ + if (PageIsAllVisible(BufferGetPage(buffer)) && + visibilitymap_clear(relation, block, vmbuffer, +VISIBILITYMAP_ALL_FROZEN)) + cleared_all_frozen = true; + MarkBufferDirty(buffer); if (RelationNeedsWAL(relation)) @@ -3982,6 +3994,8 @@ l2: xlrec.locking_xid = xmax_lock_old_tuple; xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask, oldtup.t_data->t_infomask2); + xl
Re: [HACKERS] Reviewing freeze map code
On July 16, 2016 8:49:06 AM PDT, Tom Lane wrote: >Amit Kapila writes: >> On Sat, Jul 16, 2016 at 7:02 AM, Andres Freund >wrote: >>> I think we have two choices how to deal with that: First, we can add >a >>> new flags variable to xl_heap_lock similar to >>> xl_heap_insert/update/... and bump page magic, > >> +1 for going in this way. This will keep us consistent with how >clear >> the visibility info in other places like heap_xlog_update(). > >Yeah. We've already forced a catversion bump for beta3, and I'm about >to go fix PG_CONTROL_VERSION as well, so there's basically no downside >to doing an xlog version bump as well. At least, not if you can get it >in before Monday. OK, Cool. Will do it later today. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
Amit Kapila writes: > On Sat, Jul 16, 2016 at 7:02 AM, Andres Freund wrote: >> I think we have two choices how to deal with that: First, we can add a >> new flags variable to xl_heap_lock similar to >> xl_heap_insert/update/... and bump page magic, > +1 for going in this way. This will keep us consistent with how clear > the visibility info in other places like heap_xlog_update(). Yeah. We've already forced a catversion bump for beta3, and I'm about to go fix PG_CONTROL_VERSION as well, so there's basically no downside to doing an xlog version bump as well. At least, not if you can get it in before Monday. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Sat, Jul 16, 2016 at 7:02 AM, Andres Freund wrote: > On 2016-07-13 23:06:07 -0700, Andres Freund wrote: >> > + /* Clear only the all-frozen bit on visibility map if needed */ >> > + if (PageIsAllVisible(BufferGetPage(buffer)) && >> > + VM_ALL_FROZEN(relation, block, &vmbuffer)) >> > + { >> > + visibilitymap_clear_extended(relation, block, vmbuffer, >> > + >> > VISIBILITYMAP_ALL_FROZEN); >> > + } >> > + >> >> FWIW, I don't think it's worth introducing visibilitymap_clear_extended. >> As this is a 9.6 only patch, i think it's better to change >> visibilitymap_clear's API. > > Besides that easily fixed issue, the code also has the significant issue > that it's only performing the the visibilitymap processing in the > BLK_NEEDS_REDO case. But that's not ok, because both in the BLK_RESTORED > and the BLK_DONE cases the visibilitymap isn't guaranteed (or even > likely in the former case) to have been updated. > > I think we have two choices how to deal with that: First, we can add a > new flags variable to xl_heap_lock similar to > xl_heap_insert/update/... and bump page magic, > +1 for going in this way. This will keep us consistent with how clear the visibility info in other places like heap_xlog_update(). -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 2016-07-13 23:06:07 -0700, Andres Freund wrote: > > + /* Clear only the all-frozen bit on visibility map if needed */ > > + if (PageIsAllVisible(BufferGetPage(buffer)) && > > + VM_ALL_FROZEN(relation, block, &vmbuffer)) > > + { > > + visibilitymap_clear_extended(relation, block, vmbuffer, > > + > > VISIBILITYMAP_ALL_FROZEN); > > + } > > + > > FWIW, I don't think it's worth introducing visibilitymap_clear_extended. > As this is a 9.6 only patch, i think it's better to change > visibilitymap_clear's API. Besides that easily fixed issue, the code also has the significant issue that it's only performing the the visibilitymap processing in the BLK_NEEDS_REDO case. But that's not ok, because both in the BLK_RESTORED and the BLK_DONE cases the visibilitymap isn't guaranteed (or even likely in the former case) to have been updated. I think we have two choices how to deal with that: First, we can add a new flags variable to xl_heap_lock similar to xl_heap_insert/update/... and bump page magic, or we can squeeze the information into infobits_set. The latter seems fairly ugly, and fragile to me; so unless somebody protests I'm going with the former. I think due to padding the additional byte doesn't make any size difference anyway. Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 2016-07-14 20:53:07 -0700, Andres Freund wrote: > On 2016-07-13 23:06:07 -0700, Andres Freund wrote: > > won't enter the branch, because HEAP_XMAX_LOCK_ONLY won't be set. Which > > will leave t_ctid and HEAP_HOT_UPDATED set differently on the master and > > standby / after crash recovery. I'm failing to see any harmful > > consequences right now, but differences between master and standby are a bad > > thing. > > I think it's actually critical, because HEAP_HOT_UPDATED / > HEAP_XMAX_LOCK_ONLY are used to terminate ctid chasing loops (like > heap_hot_search_buffer()). I've pushed a quite heavily revised version of the first patch to 9.1-master. I manually verified using pageinspect, gdb breakpoints and a standby that xmax, infomask etc are set correctly (leading to finding a4d357bf). As there's noticeable differences, especially 9.2->9.3, between versions, I'd welcome somebody having a look at the commits. Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 2016-07-13 23:06:07 -0700, Andres Freund wrote: > won't enter the branch, because HEAP_XMAX_LOCK_ONLY won't be set. Which > will leave t_ctid and HEAP_HOT_UPDATED set differently on the master and > standby / after crash recovery. I'm failing to see any harmful > consequences right now, but differences between master and standby are a bad > thing. I think it's actually critical, because HEAP_HOT_UPDATED / HEAP_XMAX_LOCK_ONLY are used to terminate ctid chasing loops (like heap_hot_search_buffer()). Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 2016-07-14 18:12:42 +0530, Amit Kapila wrote: > Just thinking out loud. If we set HEAP_XMAX_LOCK_ONLY during update, > then won't it impact the return value of > HeapTupleHeaderIsOnlyLocked(). It will start returning true whereas > otherwise I think it would have returned false due to in_progress > transaction. As HeapTupleHeaderIsOnlyLocked() is being used at many > places, it might impact those cases, I have not checked in deep > whether such an impact would cause any real issue, but it seems to me > that some analysis is needed there unless you think we are safe with > respect to that. I don't think that's an issue: Right now the row will be considered deleted in that moment, with the change it's considered locked. the latter is surely more appropriate. > > Any arguments against? > > > >> > >> + /* Clear only the all-frozen bit on visibility map if needed > >> */ > >> + if (PageIsAllVisible(BufferGetPage(buffer)) && > >> + VM_ALL_FROZEN(relation, block, &vmbuffer)) > >> + { > >> + visibilitymap_clear_extended(relation, block, > >> vmbuffer, > >> + > >> VISIBILITYMAP_ALL_FROZEN); > >> + } > >> + > > > > FWIW, I don't think it's worth introducing visibilitymap_clear_extended. > > As this is a 9.6 only patch, i think it's better to change > > visibilitymap_clear's API. > > > > Unless somebody protests I'm planning to commit with those adjustments > > tomorrow. > > > > Do you think performance tests done by Sawada-san are sufficient to > proceed here? I'm doing some more, but generally yes. I also don't think we have much of a choice anyway. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Thu, Jul 14, 2016 at 11:36 AM, Andres Freund wrote: > Hi, > > Master does > /* temporarily make it look not-updated */ > oldtup.t_data->t_ctid = oldtup.t_self; > here, and as is the wal record won't reflect that, because: > static void > heap_xlog_lock(XLogReaderState *record) > { > ... > /* > * Clear relevant update flags, but only if the modified > infomask says > * there's no update. > */ > if (HEAP_XMAX_IS_LOCKED_ONLY(htup->t_infomask)) > { > HeapTupleHeaderClearHotUpdated(htup); > /* Make sure there is no forward chain link in t_ctid > */ > ItemPointerSet(&htup->t_ctid, > > BufferGetBlockNumber(buffer), >offnum); > } > won't enter the branch, because HEAP_XMAX_LOCK_ONLY won't be set. Which > will leave t_ctid and HEAP_HOT_UPDATED set differently on the master and > standby / after crash recovery. I'm failing to see any harmful > consequences right now, but differences between master and standby are a bad > thing. Pre 9.3 that's not a problem, we reset ctid and HOT_UPDATED > unconditionally there. I think I'm more comfortable with setting > HEAP_XMAX_LOCK_ONLY until the tuple is finally updated - that also > coincides more closely with the actual meaning. > Just thinking out loud. If we set HEAP_XMAX_LOCK_ONLY during update, then won't it impact the return value of HeapTupleHeaderIsOnlyLocked(). It will start returning true whereas otherwise I think it would have returned false due to in_progress transaction. As HeapTupleHeaderIsOnlyLocked() is being used at many places, it might impact those cases, I have not checked in deep whether such an impact would cause any real issue, but it seems to me that some analysis is needed there unless you think we are safe with respect to that. > Any arguments against? > >> >> + /* Clear only the all-frozen bit on visibility map if needed */ >> + if (PageIsAllVisible(BufferGetPage(buffer)) && >> + VM_ALL_FROZEN(relation, block, &vmbuffer)) >> + { >> + visibilitymap_clear_extended(relation, block, vmbuffer, >> + >> VISIBILITYMAP_ALL_FROZEN); >> + } >> + > > FWIW, I don't think it's worth introducing visibilitymap_clear_extended. > As this is a 9.6 only patch, i think it's better to change > visibilitymap_clear's API. > > Unless somebody protests I'm planning to commit with those adjustments > tomorrow. > Do you think performance tests done by Sawada-san are sufficient to proceed here? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
Hi, So I'm generally happy with 0001, baring some relatively minor adjustments. I am however wondering about one thing: On 2016-07-11 23:51:05 +0900, Masahiko Sawada wrote: > diff --git a/src/backend/access/heap/heapam.c > b/src/backend/access/heap/heapam.c > index 57da57a..e7cb8ca 100644 > --- a/src/backend/access/heap/heapam.c > +++ b/src/backend/access/heap/heapam.c > @@ -3923,6 +3923,16 @@ l2: > > if (need_toast || newtupsize > pagefree) > { > + /* > + * For crash safety, we need to emit that xmax of old tuple is > set > + * and clear only the all-frozen bit on visibility map if needed > + * before releasing the buffer. We can reuse xl_heap_lock for > this > + * purpose. It should be fine even if we crash midway from this > + * section and the actual updating one later, since the xmax > will > + * appear to come from an aborted xid. > + */ > + START_CRIT_SECTION(); > + > /* Clear obsolete visibility flags ... */ > oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED); > oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED; > @@ -3936,6 +3946,28 @@ l2: > /* temporarily make it look not-updated */ > oldtup.t_data->t_ctid = oldtup.t_self; > already_marked = true; > + > + MarkBufferDirty(buffer); > + > + if (RelationNeedsWAL(relation)) > + { > + xl_heap_lock xlrec; > + XLogRecPtr recptr; > + > + XLogBeginInsert(); > + XLogRegisterBuffer(0, buffer, REGBUF_STANDARD); > + > + xlrec.offnum = > ItemPointerGetOffsetNumber(&oldtup.t_self); > + xlrec.locking_xid = xmax_old_tuple; > + xlrec.infobits_set = > compute_infobits(oldtup.t_data->t_infomask, > + > oldtup.t_data->t_infomask2); > + XLogRegisterData((char *) &xlrec, SizeOfHeapLock); > + recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK); > + PageSetLSN(page, recptr); > + } Master does /* temporarily make it look not-updated */ oldtup.t_data->t_ctid = oldtup.t_self; here, and as is the wal record won't reflect that, because: static void heap_xlog_lock(XLogReaderState *record) { ... /* * Clear relevant update flags, but only if the modified infomask says * there's no update. */ if (HEAP_XMAX_IS_LOCKED_ONLY(htup->t_infomask)) { HeapTupleHeaderClearHotUpdated(htup); /* Make sure there is no forward chain link in t_ctid */ ItemPointerSet(&htup->t_ctid, BufferGetBlockNumber(buffer), offnum); } won't enter the branch, because HEAP_XMAX_LOCK_ONLY won't be set. Which will leave t_ctid and HEAP_HOT_UPDATED set differently on the master and standby / after crash recovery. I'm failing to see any harmful consequences right now, but differences between master and standby are a bad thing. Pre 9.3 that's not a problem, we reset ctid and HOT_UPDATED unconditionally there. I think I'm more comfortable with setting HEAP_XMAX_LOCK_ONLY until the tuple is finally updated - that also coincides more closely with the actual meaning. Any arguments against? > > + /* Clear only the all-frozen bit on visibility map if needed */ > + if (PageIsAllVisible(BufferGetPage(buffer)) && > + VM_ALL_FROZEN(relation, block, &vmbuffer)) > + { > + visibilitymap_clear_extended(relation, block, vmbuffer, > + > VISIBILITYMAP_ALL_FROZEN); > + } > + FWIW, I don't think it's worth introducing visibilitymap_clear_extended. As this is a 9.6 only patch, i think it's better to change visibilitymap_clear's API. Unless somebody protests I'm planning to commit with those adjustments tomorrow. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Fri, Jul 8, 2016 at 10:24 PM, Amit Kapila wrote: > On Thu, Jul 7, 2016 at 12:34 PM, Masahiko Sawada > wrote: >> Than you for reviewing! >> >> On Thu, Jul 7, 2016 at 7:06 AM, Andres Freund wrote: >>> On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote: diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 57da57a..fd66527 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -3923,6 +3923,17 @@ l2: if (need_toast || newtupsize > pagefree) { + /* + * To prevent data corruption due to updating old tuple by + * other backends after released buffer >>> >>> That's not really the reason, is it? The prime problem is crash safety / >>> replication. The row-lock we're faking (by setting xmax to our xid), >>> prevents concurrent updates until this backend died. >> >> Fixed. >> , we need to emit that + * xmax of old tuple is set and clear visibility map bits if + * needed before releasing buffer. We can reuse xl_heap_lock + * for this purpose. It should be fine even if we crash midway + * from this section and the actual updating one later, since + * the xmax will appear to come from an aborted xid. + */ + START_CRIT_SECTION(); + /* Clear obsolete visibility flags ... */ oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED); oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED; @@ -3936,6 +3947,46 @@ l2: /* temporarily make it look not-updated */ oldtup.t_data->t_ctid = oldtup.t_self; already_marked = true; + + /* Clear PD_ALL_VISIBLE flags */ + if (PageIsAllVisible(BufferGetPage(buffer))) + { + all_visible_cleared = true; + PageClearAllVisible(BufferGetPage(buffer)); + visibilitymap_clear(relation, BufferGetBlockNumber(buffer), + vmbuffer); + } + + MarkBufferDirty(buffer); + + if (RelationNeedsWAL(relation)) + { + xl_heap_lock xlrec; + XLogRecPtr recptr; + + /* + * For logical decoding we need combocids to properly decode the + * catalog. + */ + if (RelationIsAccessibleInLogicalDecoding(relation)) + log_heap_new_cid(relation, &oldtup); >>> >>> Hm, I don't see that being necessary here. Row locks aren't logically >>> decoded, so there's no need to emit this here. >> >> Fixed. >> >>> + /* Clear PD_ALL_VISIBLE flags */ + if (PageIsAllVisible(page)) + { + Buffer vmbuffer = InvalidBuffer; + BlockNumber block = BufferGetBlockNumber(*buffer); + + all_visible_cleared = true; + PageClearAllVisible(page); + visibilitymap_pin(relation, block, &vmbuffer); + visibilitymap_clear(relation, block, vmbuffer); + } + >>> >>> That clears all-visible unnecessarily, we only need to clear all-frozen. >>> >> >> Fixed. >> >>> @@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record) } HeapTupleHeaderSetXmax(htup, xlrec->locking_xid); HeapTupleHeaderSetCmax(htup, FirstCommandId, false); + + /* The visibility map need to be cleared */ + if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0) + { + RelFileNode rnode; + Buffer vmbuffer = InvalidBuffer; + BlockNumber blkno; + Relationreln; + + XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno); + reln = CreateFakeRelcacheEntry(rnode); + + visibilitymap_pin(reln, blkno, &vmbuffer); + visibilitymap_clear(reln, blkno, vmbuffer); + PageClearAllVisible(page); + } + >>> >>> PageSetLSN(page, lsn); MarkBufferDirty(buffer); } diff --git a/src/include/access/heapam_xlog.h b/src/include/access/heapam_xlog.h index a822d0b..41b3c54 100644 --- a/src/include/access/heapam_xlog.h +++ b/src/include/access/heapam_xlog.h @@ -242,6 +242,7 @
Re: [HACKERS] Reviewing freeze map code
On Thu, Jul 7, 2016 at 12:34 PM, Masahiko Sawada wrote: > Than you for reviewing! > > On Thu, Jul 7, 2016 at 7:06 AM, Andres Freund wrote: >> On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote: >>> diff --git a/src/backend/access/heap/heapam.c >>> b/src/backend/access/heap/heapam.c >>> index 57da57a..fd66527 100644 >>> --- a/src/backend/access/heap/heapam.c >>> +++ b/src/backend/access/heap/heapam.c >>> @@ -3923,6 +3923,17 @@ l2: >>> >>> if (need_toast || newtupsize > pagefree) >>> { >>> + /* >>> + * To prevent data corruption due to updating old tuple by >>> + * other backends after released buffer >> >> That's not really the reason, is it? The prime problem is crash safety / >> replication. The row-lock we're faking (by setting xmax to our xid), >> prevents concurrent updates until this backend died. > > Fixed. > >>> , we need to emit that >>> + * xmax of old tuple is set and clear visibility map bits if >>> + * needed before releasing buffer. We can reuse xl_heap_lock >>> + * for this purpose. It should be fine even if we crash midway >>> + * from this section and the actual updating one later, since >>> + * the xmax will appear to come from an aborted xid. >>> + */ >>> + START_CRIT_SECTION(); >>> + >>> /* Clear obsolete visibility flags ... */ >>> oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED); >>> oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED; >>> @@ -3936,6 +3947,46 @@ l2: >>> /* temporarily make it look not-updated */ >>> oldtup.t_data->t_ctid = oldtup.t_self; >>> already_marked = true; >>> + >>> + /* Clear PD_ALL_VISIBLE flags */ >>> + if (PageIsAllVisible(BufferGetPage(buffer))) >>> + { >>> + all_visible_cleared = true; >>> + PageClearAllVisible(BufferGetPage(buffer)); >>> + visibilitymap_clear(relation, >>> BufferGetBlockNumber(buffer), >>> + vmbuffer); >>> + } >>> + >>> + MarkBufferDirty(buffer); >>> + >>> + if (RelationNeedsWAL(relation)) >>> + { >>> + xl_heap_lock xlrec; >>> + XLogRecPtr recptr; >>> + >>> + /* >>> + * For logical decoding we need combocids to properly >>> decode the >>> + * catalog. >>> + */ >>> + if (RelationIsAccessibleInLogicalDecoding(relation)) >>> + log_heap_new_cid(relation, &oldtup); >> >> Hm, I don't see that being necessary here. Row locks aren't logically >> decoded, so there's no need to emit this here. > > Fixed. > >> >>> + /* Clear PD_ALL_VISIBLE flags */ >>> + if (PageIsAllVisible(page)) >>> + { >>> + Buffer vmbuffer = InvalidBuffer; >>> + BlockNumber block = BufferGetBlockNumber(*buffer); >>> + >>> + all_visible_cleared = true; >>> + PageClearAllVisible(page); >>> + visibilitymap_pin(relation, block, &vmbuffer); >>> + visibilitymap_clear(relation, block, vmbuffer); >>> + } >>> + >> >> That clears all-visible unnecessarily, we only need to clear all-frozen. >> > > Fixed. > >> >>> @@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record) >>> } >>> HeapTupleHeaderSetXmax(htup, xlrec->locking_xid); >>> HeapTupleHeaderSetCmax(htup, FirstCommandId, false); >>> + >>> + /* The visibility map need to be cleared */ >>> + if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0) >>> + { >>> + RelFileNode rnode; >>> + Buffer vmbuffer = InvalidBuffer; >>> + BlockNumber blkno; >>> + Relationreln; >>> + >>> + XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno); >>> + reln = CreateFakeRelcacheEntry(rnode); >>> + >>> + visibilitymap_pin(reln, blkno, &vmbuffer); >>> + visibilitymap_clear(reln, blkno, vmbuffer); >>> + PageClearAllVisible(page); >>> + } >>> + >> >> >>> PageSetLSN(page, lsn); >>> MarkBufferDirty(buffer); >>> } >>> diff --git a/src/include/access/heapam_xlog.h >>> b/src/include/access/heapam_xlog.h >>> index a822d0b..41b3c54 100644 >>> --- a/src/include/access/heapam_xlog.h >>> +++ b/src/include/access/heapam_xlog.h >>> @@ -242,6 +242,7 @@ typedef struct xl_heap_cleanup_info >>> #define XLHL_XMAX_EXCL_LOCK 0x04 >>> #define XLHL_XMAX_KEYSHR_LOCK0x08 >>> #define XLHL_KEYS_UPDATED0x10 >>> +#d
Re: [HACKERS] Reviewing freeze map code
On Thu, Jul 7, 2016 at 2:04 PM, Andres Freund wrote: >> Yeah, that's true, but I'm having a bit of trouble imagining exactly >> we end up with corruption that actually matters. I guess a torn page >> could do it. > > I think Noah pointed out a bad scenario: If we crash after putting the > xid in the page header, but before WAL logging, the xid could get reused > after the crash. By a different transaction. And suddenly the row isn't > visible anymore, after the reused xid commits... Oh, wow. Yikes. OK, so I guess we should try to back-patch the fix, then. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 2016-07-07 14:01:05 -0400, Robert Haas wrote: > On Thu, Jul 7, 2016 at 1:58 PM, Andres Freund wrote: > > On 2016-07-07 10:37:15 -0400, Robert Haas wrote: > >> On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund wrote: > >> > Hm. We can't easily do that in the back-patched version; because a > >> > standby won't know to check for the flag . That's kinda ok, since we > >> > don't yet need to clear all-visible yet at that point of > >> > heap_update. But that better means we don't do so on the master either. > >> > >> Is there any reason to back-patch this in the first place? > > > > It seems not unlikely that this has caused corruption in the past; and > > that we chalked it up to hardware corruption or something. Both toasting > > and file extension frequently take extended amounts of time under load, > > the window for crashing in the wrong moment isn't small... > > Yeah, that's true, but I'm having a bit of trouble imagining exactly > we end up with corruption that actually matters. I guess a torn page > could do it. I think Noah pointed out a bad scenario: If we crash after putting the xid in the page header, but before WAL logging, the xid could get reused after the crash. By a different transaction. And suddenly the row isn't visible anymore, after the reused xid commits... -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Thu, Jul 7, 2016 at 1:58 PM, Andres Freund wrote: > On 2016-07-07 10:37:15 -0400, Robert Haas wrote: >> On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund wrote: >> > Hm. We can't easily do that in the back-patched version; because a >> > standby won't know to check for the flag . That's kinda ok, since we >> > don't yet need to clear all-visible yet at that point of >> > heap_update. But that better means we don't do so on the master either. >> >> Is there any reason to back-patch this in the first place? > > It seems not unlikely that this has caused corruption in the past; and > that we chalked it up to hardware corruption or something. Both toasting > and file extension frequently take extended amounts of time under load, > the window for crashing in the wrong moment isn't small... Yeah, that's true, but I'm having a bit of trouble imagining exactly we end up with corruption that actually matters. I guess a torn page could do it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 2016-07-07 10:37:15 -0400, Robert Haas wrote: > On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund wrote: > > Hm. We can't easily do that in the back-patched version; because a > > standby won't know to check for the flag . That's kinda ok, since we > > don't yet need to clear all-visible yet at that point of > > heap_update. But that better means we don't do so on the master either. > > Is there any reason to back-patch this in the first place? It seems not unlikely that this has caused corruption in the past; and that we chalked it up to hardware corruption or something. Both toasting and file extension frequently take extended amounts of time under load, the window for crashing in the wrong moment isn't small... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Thu, Jul 7, 2016 at 10:53 AM, Alvaro Herrera wrote: > Robert Haas wrote: >> On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund wrote: >> > Hm. We can't easily do that in the back-patched version; because a >> > standby won't know to check for the flag . That's kinda ok, since we >> > don't yet need to clear all-visible yet at that point of >> > heap_update. But that better means we don't do so on the master either. >> >> Is there any reason to back-patch this in the first place? > > Wasn't this determined to be a pre-existing bug? I think the > probability of occurrence has increased, but it's still possible in > earlier releases. I wonder if there are unexplained bugs that can be > traced down to this. > > I'm not really following this (sorry about that) but I wonder if (in > back branches) the failure to propagate in case the standby wasn't > updated can cause actual problems. If it does, maybe it'd be a better > idea to have a new WAL record type instead of piggybacking on lock > tuple. Then again, apparently the probability of this bug is low enough > that we shouldn't sweat over it ... Moreso considering Robert's apparent > opinion that perhaps we shouldn't backpatch at all in the first place. > > In any case I would like to see much more commentary in the patch next > to the new XLHL flag. It's sufficiently different than the rest than it > deserves so, IMO. There are two issues being discussed on this thread. One of them is a new issue in 9.6: heap_lock_tuple needs to clear the all-frozen bit in the freeze map even though it does not clear all-visible. The one that's actually a preexisting bug is that we can start to update a tuple without WAL-logging anything and then release the page lock in order to go perform TOAST insertions. At this point, other backends (on the master) will see this tuple as in the process of being updated because xmax has been set and ctid has been made to point back to the same tuple. I'm guessing that if the UPDATE goes on to complete, any discrepancy between the master and the standby is erased by the replay of the WAL record covering the update itself. I haven't checked that, but it seems like that WAL record must set both xmax and ctid appropriately or we'd be in big trouble. The infomask bits are in play too, but presumably the update's WAL is going to set those correctly also. So in this case I don't think there's really any issue for the standby. Or for the master, either: it may technically be true the tuple is not all-visible any more, but the only backend that could potentially fail to see it is the one performing the update, and no user code can run in the middle of toast_insert_or_update, so I think we're OK. On the other hand, if the UPDATE aborts, there's now a persistent difference between the master and standby: the infomask, xmax, and ctid of the tuple may differ. I don't know whether that could cause any problem. It's probably a very rare case, because there aren't all that many things that will cause us to abort in the middle of inserting TOAST tuples. Out of disk space comes to mind, or maybe some kind of corruption that throws an elog(). As far as back-patching goes, the question is whether it's worth the risk. Introducing new WAL logging at this point could certainly cause performance problems if nothing else, never mind the risk of garden-variety bugs. I'm not sure it's worth taking that risk in released branches for the sake of a bug which has existed for a decade without anybody finding it until now. I'm not going to argue strongly for that position, but I think it's worth thinking about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
Robert Haas wrote: > On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund wrote: > > Hm. We can't easily do that in the back-patched version; because a > > standby won't know to check for the flag . That's kinda ok, since we > > don't yet need to clear all-visible yet at that point of > > heap_update. But that better means we don't do so on the master either. > > Is there any reason to back-patch this in the first place? Wasn't this determined to be a pre-existing bug? I think the probability of occurrence has increased, but it's still possible in earlier releases. I wonder if there are unexplained bugs that can be traced down to this. I'm not really following this (sorry about that) but I wonder if (in back branches) the failure to propagate in case the standby wasn't updated can cause actual problems. If it does, maybe it'd be a better idea to have a new WAL record type instead of piggybacking on lock tuple. Then again, apparently the probability of this bug is low enough that we shouldn't sweat over it ... Moreso considering Robert's apparent opinion that perhaps we shouldn't backpatch at all in the first place. In any case I would like to see much more commentary in the patch next to the new XLHL flag. It's sufficiently different than the rest than it deserves so, IMO. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Wed, Jul 6, 2016 at 6:06 PM, Andres Freund wrote: > Hm. We can't easily do that in the back-patched version; because a > standby won't know to check for the flag . That's kinda ok, since we > don't yet need to clear all-visible yet at that point of > heap_update. But that better means we don't do so on the master either. Is there any reason to back-patch this in the first place? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
Than you for reviewing! On Thu, Jul 7, 2016 at 7:06 AM, Andres Freund wrote: > On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote: >> diff --git a/src/backend/access/heap/heapam.c >> b/src/backend/access/heap/heapam.c >> index 57da57a..fd66527 100644 >> --- a/src/backend/access/heap/heapam.c >> +++ b/src/backend/access/heap/heapam.c >> @@ -3923,6 +3923,17 @@ l2: >> >> if (need_toast || newtupsize > pagefree) >> { >> + /* >> + * To prevent data corruption due to updating old tuple by >> + * other backends after released buffer > > That's not really the reason, is it? The prime problem is crash safety / > replication. The row-lock we're faking (by setting xmax to our xid), > prevents concurrent updates until this backend died. Fixed. >> , we need to emit that >> + * xmax of old tuple is set and clear visibility map bits if >> + * needed before releasing buffer. We can reuse xl_heap_lock >> + * for this purpose. It should be fine even if we crash midway >> + * from this section and the actual updating one later, since >> + * the xmax will appear to come from an aborted xid. >> + */ >> + START_CRIT_SECTION(); >> + >> /* Clear obsolete visibility flags ... */ >> oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED); >> oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED; >> @@ -3936,6 +3947,46 @@ l2: >> /* temporarily make it look not-updated */ >> oldtup.t_data->t_ctid = oldtup.t_self; >> already_marked = true; >> + >> + /* Clear PD_ALL_VISIBLE flags */ >> + if (PageIsAllVisible(BufferGetPage(buffer))) >> + { >> + all_visible_cleared = true; >> + PageClearAllVisible(BufferGetPage(buffer)); >> + visibilitymap_clear(relation, >> BufferGetBlockNumber(buffer), >> + vmbuffer); >> + } >> + >> + MarkBufferDirty(buffer); >> + >> + if (RelationNeedsWAL(relation)) >> + { >> + xl_heap_lock xlrec; >> + XLogRecPtr recptr; >> + >> + /* >> + * For logical decoding we need combocids to properly >> decode the >> + * catalog. >> + */ >> + if (RelationIsAccessibleInLogicalDecoding(relation)) >> + log_heap_new_cid(relation, &oldtup); > > Hm, I don't see that being necessary here. Row locks aren't logically > decoded, so there's no need to emit this here. Fixed. > >> + /* Clear PD_ALL_VISIBLE flags */ >> + if (PageIsAllVisible(page)) >> + { >> + Buffer vmbuffer = InvalidBuffer; >> + BlockNumber block = BufferGetBlockNumber(*buffer); >> + >> + all_visible_cleared = true; >> + PageClearAllVisible(page); >> + visibilitymap_pin(relation, block, &vmbuffer); >> + visibilitymap_clear(relation, block, vmbuffer); >> + } >> + > > That clears all-visible unnecessarily, we only need to clear all-frozen. > Fixed. > >> @@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record) >> } >> HeapTupleHeaderSetXmax(htup, xlrec->locking_xid); >> HeapTupleHeaderSetCmax(htup, FirstCommandId, false); >> + >> + /* The visibility map need to be cleared */ >> + if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0) >> + { >> + RelFileNode rnode; >> + Buffer vmbuffer = InvalidBuffer; >> + BlockNumber blkno; >> + Relationreln; >> + >> + XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno); >> + reln = CreateFakeRelcacheEntry(rnode); >> + >> + visibilitymap_pin(reln, blkno, &vmbuffer); >> + visibilitymap_clear(reln, blkno, vmbuffer); >> + PageClearAllVisible(page); >> + } >> + > > >> PageSetLSN(page, lsn); >> MarkBufferDirty(buffer); >> } >> diff --git a/src/include/access/heapam_xlog.h >> b/src/include/access/heapam_xlog.h >> index a822d0b..41b3c54 100644 >> --- a/src/include/access/heapam_xlog.h >> +++ b/src/include/access/heapam_xlog.h >> @@ -242,6 +242,7 @@ typedef struct xl_heap_cleanup_info >> #define XLHL_XMAX_EXCL_LOCK 0x04 >> #define XLHL_XMAX_KEYSHR_LOCK0x08 >> #define XLHL_KEYS_UPDATED0x10 >> +#define XLHL_ALL_VISIBLE_CLEARED 0x20 > > Hm. We can't easily do that in the back-patched version; because a > standby won't know to check for the flag . That's kinda ok, since we > don't ye
Re: [HACKERS] Reviewing freeze map code
On Thu, Jul 7, 2016 at 3:36 AM, Andres Freund wrote: > On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote: > >> @@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record) >> } >> HeapTupleHeaderSetXmax(htup, xlrec->locking_xid); >> HeapTupleHeaderSetCmax(htup, FirstCommandId, false); >> + >> + /* The visibility map need to be cleared */ >> + if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0) >> + { >> + RelFileNode rnode; >> + Buffer vmbuffer = InvalidBuffer; >> + BlockNumber blkno; >> + Relationreln; >> + >> + XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno); >> + reln = CreateFakeRelcacheEntry(rnode); >> + >> + visibilitymap_pin(reln, blkno, &vmbuffer); >> + visibilitymap_clear(reln, blkno, vmbuffer); >> + PageClearAllVisible(page); >> + } >> + > > >> PageSetLSN(page, lsn); >> MarkBufferDirty(buffer); >> } >> diff --git a/src/include/access/heapam_xlog.h >> b/src/include/access/heapam_xlog.h >> index a822d0b..41b3c54 100644 >> --- a/src/include/access/heapam_xlog.h >> +++ b/src/include/access/heapam_xlog.h >> @@ -242,6 +242,7 @@ typedef struct xl_heap_cleanup_info >> #define XLHL_XMAX_EXCL_LOCK 0x04 >> #define XLHL_XMAX_KEYSHR_LOCK0x08 >> #define XLHL_KEYS_UPDATED0x10 >> +#define XLHL_ALL_VISIBLE_CLEARED 0x20 > > Hm. We can't easily do that in the back-patched version; because a > standby won't know to check for the flag . That's kinda ok, since we > don't yet need to clear all-visible yet at that point of > heap_update. But that better means we don't do so on the master either. > To clarify, do you mean to say that lets have XLHL_ALL_FROZEN_CLEARED and do that just for master. For back-branches no need to clear visibility any flags? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 2016-07-05 23:37:59 +0900, Masahiko Sawada wrote: > diff --git a/src/backend/access/heap/heapam.c > b/src/backend/access/heap/heapam.c > index 57da57a..fd66527 100644 > --- a/src/backend/access/heap/heapam.c > +++ b/src/backend/access/heap/heapam.c > @@ -3923,6 +3923,17 @@ l2: > > if (need_toast || newtupsize > pagefree) > { > + /* > + * To prevent data corruption due to updating old tuple by > + * other backends after released buffer That's not really the reason, is it? The prime problem is crash safety / replication. The row-lock we're faking (by setting xmax to our xid), prevents concurrent updates until this backend died. > , we need to emit that > + * xmax of old tuple is set and clear visibility map bits if > + * needed before releasing buffer. We can reuse xl_heap_lock > + * for this purpose. It should be fine even if we crash midway > + * from this section and the actual updating one later, since > + * the xmax will appear to come from an aborted xid. > + */ > + START_CRIT_SECTION(); > + > /* Clear obsolete visibility flags ... */ > oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED); > oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED; > @@ -3936,6 +3947,46 @@ l2: > /* temporarily make it look not-updated */ > oldtup.t_data->t_ctid = oldtup.t_self; > already_marked = true; > + > + /* Clear PD_ALL_VISIBLE flags */ > + if (PageIsAllVisible(BufferGetPage(buffer))) > + { > + all_visible_cleared = true; > + PageClearAllVisible(BufferGetPage(buffer)); > + visibilitymap_clear(relation, > BufferGetBlockNumber(buffer), > + vmbuffer); > + } > + > + MarkBufferDirty(buffer); > + > + if (RelationNeedsWAL(relation)) > + { > + xl_heap_lock xlrec; > + XLogRecPtr recptr; > + > + /* > + * For logical decoding we need combocids to properly > decode the > + * catalog. > + */ > + if (RelationIsAccessibleInLogicalDecoding(relation)) > + log_heap_new_cid(relation, &oldtup); Hm, I don't see that being necessary here. Row locks aren't logically decoded, so there's no need to emit this here. > + /* Clear PD_ALL_VISIBLE flags */ > + if (PageIsAllVisible(page)) > + { > + Buffer vmbuffer = InvalidBuffer; > + BlockNumber block = BufferGetBlockNumber(*buffer); > + > + all_visible_cleared = true; > + PageClearAllVisible(page); > + visibilitymap_pin(relation, block, &vmbuffer); > + visibilitymap_clear(relation, block, vmbuffer); > + } > + That clears all-visible unnecessarily, we only need to clear all-frozen. > @@ -8694,6 +8761,23 @@ heap_xlog_lock(XLogReaderState *record) > } > HeapTupleHeaderSetXmax(htup, xlrec->locking_xid); > HeapTupleHeaderSetCmax(htup, FirstCommandId, false); > + > + /* The visibility map need to be cleared */ > + if ((xlrec->infobits_set & XLHL_ALL_VISIBLE_CLEARED) != 0) > + { > + RelFileNode rnode; > + Buffer vmbuffer = InvalidBuffer; > + BlockNumber blkno; > + Relationreln; > + > + XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno); > + reln = CreateFakeRelcacheEntry(rnode); > + > + visibilitymap_pin(reln, blkno, &vmbuffer); > + visibilitymap_clear(reln, blkno, vmbuffer); > + PageClearAllVisible(page); > + } > + > PageSetLSN(page, lsn); > MarkBufferDirty(buffer); > } > diff --git a/src/include/access/heapam_xlog.h > b/src/include/access/heapam_xlog.h > index a822d0b..41b3c54 100644 > --- a/src/include/access/heapam_xlog.h > +++ b/src/include/access/heapam_xlog.h > @@ -242,6 +242,7 @@ typedef struct xl_heap_cleanup_info > #define XLHL_XMAX_EXCL_LOCK 0x04 > #define XLHL_XMAX_KEYSHR_LOCK0x08 > #define XLHL_KEYS_UPDATED0x10 > +#define XLHL_ALL_VISIBLE_CLEARED 0x20 Hm. We can't easily do that in the back-patched version; because a standby won't know to check for the flag . That's kinda ok, since we don't yet need to clear all-visible yet at that point of heap_update. But that better means we don't do so on the master either. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make
Re: [HACKERS] Reviewing freeze map code
On Mon, Jul 4, 2016 at 5:44 PM, Masahiko Sawada wrote: > On Sat, Jul 2, 2016 at 12:34 PM, Amit Kapila wrote: >> On Fri, Jul 1, 2016 at 7:52 PM, Masahiko Sawada >> wrote: >>> On Fri, Jul 1, 2016 at 11:12 AM, Amit Kapila >>> wrote: Why do you think IndexOnlyScan will return wrong result? If the server crash in the way as you described, the transaction that has made modifications will anyway be considered aborted, so the result of IndexOnlyScan should not be wrong. >>> >>> Ah, you're right, I misunderstood. >>> >>> Attached updated patch incorporating your comments. >>> I've changed it so that heap_xlog_lock clears vm flags if page is >>> marked all frozen. >>> >> >> I think we should make a similar change in heap_lock_tuple API as >> well. >> Also, currently by default heap_xlog_lock tuple tries to clear >> the visibility flags, isn't it better to handle it as we do at all >> other places (ex. see log_heap_update), by logging the information >> about same. > > I will deal with them. > >> Though, it is important to get the patch right, but I feel in the >> meantime, it might be better to start benchmarking. AFAIU, even if >> change some part of information while WAL logging it, the benchmark >> results won't be much different. > > Okay, I will do the benchmark test as well. > I measured the thoughput and the output quantity of WAL with HEAD and HEAD+patch(attached) on my virtual environment. I used pgbench with attached custom script file which sets 3200 length string to the filler column in order to make toast data. The scale factor is 1000 and pgbench options are, -c 4 -T 600 -f toast_test.sql. After changed filler column to the text data type I ran it. * Throughput HEAD : 1833.204172 Patched : 1827.399482 * Output quantity of WAL HEAD : 7771 MB Patched : 8082 MB The throughput is almost same, but the ouput quantity of WAL is slightly increased. (about 4%) Regards, -- Masahiko Sawada toast_test.sql Description: Binary data diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 57da57a..fd66527 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -3923,6 +3923,17 @@ l2: if (need_toast || newtupsize > pagefree) { + /* + * To prevent data corruption due to updating old tuple by + * other backends after released buffer, we need to emit that + * xmax of old tuple is set and clear visibility map bits if + * needed before releasing buffer. We can reuse xl_heap_lock + * for this purpose. It should be fine even if we crash midway + * from this section and the actual updating one later, since + * the xmax will appear to come from an aborted xid. + */ + START_CRIT_SECTION(); + /* Clear obsolete visibility flags ... */ oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED); oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED; @@ -3936,6 +3947,46 @@ l2: /* temporarily make it look not-updated */ oldtup.t_data->t_ctid = oldtup.t_self; already_marked = true; + + /* Clear PD_ALL_VISIBLE flags */ + if (PageIsAllVisible(BufferGetPage(buffer))) + { + all_visible_cleared = true; + PageClearAllVisible(BufferGetPage(buffer)); + visibilitymap_clear(relation, BufferGetBlockNumber(buffer), +vmbuffer); + } + + MarkBufferDirty(buffer); + + if (RelationNeedsWAL(relation)) + { + xl_heap_lock xlrec; + XLogRecPtr recptr; + + /* + * For logical decoding we need combocids to properly decode the + * catalog. + */ + if (RelationIsAccessibleInLogicalDecoding(relation)) +log_heap_new_cid(relation, &oldtup); + + XLogBeginInsert(); + XLogRegisterBuffer(0, buffer, REGBUF_STANDARD); + + xlrec.offnum = ItemPointerGetOffsetNumber(&oldtup.t_self); + xlrec.locking_xid = xmax_old_tuple; + xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask, + oldtup.t_data->t_infomask2); + if (all_visible_cleared) +xlrec.infobits_set |= XLHL_ALL_VISIBLE_CLEARED; + XLogRegisterData((char *) &xlrec, SizeOfHeapLock); + recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK); + PageSetLSN(page, recptr); + } + + END_CRIT_SECTION(); + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); /* @@ -4140,7 +4191,8 @@ l2: */ if (RelationIsAccessibleInLogicalDecoding(relation)) { - log_heap_new_cid(relation, &oldtup); + if (!already_marked) +log_heap_new_cid(relation, &oldtup); log_heap_new_cid(relation, heaptup); } @@ -4513,6 +4565,7 @@ heap_lock_tuple(Relation relation, HeapTuple tuple, new_infomask2; bool first_time = true; bool have_tuple_lock = false; + bool all_visible_cleared = false; *buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(tid)); LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); @@ -5034,6 +5087,18 @@ failed: if (HEAP_XMAX_IS_LOCKED_ONLY(new_infomask)) tuple->t_data->t_ctid = *tid; + /* Clear PD_ALL_VISIBLE flags */ + if (PageIsAllVisible(page)) + { + Buffer vmbuffer = InvalidBuff
Re: [HACKERS] Reviewing freeze map code
On Mon, Jul 4, 2016 at 2:31 PM, Masahiko Sawada wrote: > On Sat, Jul 2, 2016 at 12:17 PM, Amit Kapila wrote: >> On Sat, Jul 2, 2016 at 12:53 AM, Andres Freund wrote: >>> On 2016-07-01 15:18:39 -0400, Robert Haas wrote: >>> Should we just clear all-visible and call it good enough? >>> >>> Given that we need to do that in heap_lock_tuple, which entirely >>> preserves all-visible (but shouldn't preserve all-frozen), ISTM we >>> better find something that doesn't invalidate all-visible. >>> >> >> Sounds logical, considering that we have a way to set all-frozen and >> vacuum does that as well. So probably either we need to have a new >> API or add a new parameter to visibilitymap_clear() to indicate the >> same. If we want to go that route, isn't it better to have >> PD_ALL_FROZEN as well? >> > > Cant' we call visibilitymap_set with all-visible but not all-frozen > bits instead of clearing flags? > That doesn't sound to be an impressive way to deal. First, visibilitymap_set logs the action itself which will generate two WAL records (one for visibility map and another for lock tuple). Second, it doesn't look consistent to me. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Sat, Jul 2, 2016 at 12:17 PM, Amit Kapila wrote: > On Sat, Jul 2, 2016 at 12:53 AM, Andres Freund wrote: >> On 2016-07-01 15:18:39 -0400, Robert Haas wrote: >>> On Fri, Jul 1, 2016 at 10:22 AM, Masahiko Sawada >>> wrote: >>> > Ah, you're right, I misunderstood. >>> > >>> > Attached updated patch incorporating your comments. >>> > I've changed it so that heap_xlog_lock clears vm flags if page is >>> > marked all frozen. >>> >>> I believe that this should be separated into two patches, since there >>> are two issues here: >>> >>> 1. Locking a tuple doesn't clear the all-frozen bit, but needs to do so. >>> 2. heap_update releases the buffer content lock without logging the >>> changes it has made. >>> >>> With respect to #1, there is no need to clear the all-visible bit, >>> only the all-frozen bit. However, that's a bit tricky given that we >>> removed PD_ALL_FROZEN. Should we think about putting that back again? >> >> I think it's fine to just do the vm lookup. >> >>> Should we just clear all-visible and call it good enough? >> >> Given that we need to do that in heap_lock_tuple, which entirely >> preserves all-visible (but shouldn't preserve all-frozen), ISTM we >> better find something that doesn't invalidate all-visible. >> > > Sounds logical, considering that we have a way to set all-frozen and > vacuum does that as well. So probably either we need to have a new > API or add a new parameter to visibilitymap_clear() to indicate the > same. If we want to go that route, isn't it better to have > PD_ALL_FROZEN as well? > Cant' we call visibilitymap_set with all-visible but not all-frozen bits instead of clearing flags? Regards, -- Masahiko Sawada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Sat, Jul 2, 2016 at 12:34 PM, Amit Kapila wrote: > On Fri, Jul 1, 2016 at 7:52 PM, Masahiko Sawada wrote: >> On Fri, Jul 1, 2016 at 11:12 AM, Amit Kapila wrote: >>> >>> Why do you think IndexOnlyScan will return wrong result? If the >>> server crash in the way as you described, the transaction that has >>> made modifications will anyway be considered aborted, so the result of >>> IndexOnlyScan should not be wrong. >>> >> >> Ah, you're right, I misunderstood. >> >> Attached updated patch incorporating your comments. >> I've changed it so that heap_xlog_lock clears vm flags if page is >> marked all frozen. >> > > I think we should make a similar change in heap_lock_tuple API as > well. > Also, currently by default heap_xlog_lock tuple tries to clear > the visibility flags, isn't it better to handle it as we do at all > other places (ex. see log_heap_update), by logging the information > about same. I will deal with them. > Though, it is important to get the patch right, but I feel in the > meantime, it might be better to start benchmarking. AFAIU, even if > change some part of information while WAL logging it, the benchmark > results won't be much different. Okay, I will do the benchmark test as well. Regards, -- Masahiko Sawada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Fri, Jul 1, 2016 at 7:52 PM, Masahiko Sawada wrote: > On Fri, Jul 1, 2016 at 11:12 AM, Amit Kapila wrote: >> >> Why do you think IndexOnlyScan will return wrong result? If the >> server crash in the way as you described, the transaction that has >> made modifications will anyway be considered aborted, so the result of >> IndexOnlyScan should not be wrong. >> > > Ah, you're right, I misunderstood. > > Attached updated patch incorporating your comments. > I've changed it so that heap_xlog_lock clears vm flags if page is > marked all frozen. > I think we should make a similar change in heap_lock_tuple API as well. Also, currently by default heap_xlog_lock tuple tries to clear the visibility flags, isn't it better to handle it as we do at all other places (ex. see log_heap_update), by logging the information about same. I think it is always advisable to log every action we want replay to perform. That way, it is always easy to extend it based on if some change is required only in certain cases, but not in others. Though, it is important to get the patch right, but I feel in the meantime, it might be better to start benchmarking. AFAIU, even if change some part of information while WAL logging it, the benchmark results won't be much different. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Sat, Jul 2, 2016 at 12:53 AM, Andres Freund wrote: > On 2016-07-01 15:18:39 -0400, Robert Haas wrote: >> On Fri, Jul 1, 2016 at 10:22 AM, Masahiko Sawada >> wrote: >> > Ah, you're right, I misunderstood. >> > >> > Attached updated patch incorporating your comments. >> > I've changed it so that heap_xlog_lock clears vm flags if page is >> > marked all frozen. >> >> I believe that this should be separated into two patches, since there >> are two issues here: >> >> 1. Locking a tuple doesn't clear the all-frozen bit, but needs to do so. >> 2. heap_update releases the buffer content lock without logging the >> changes it has made. >> >> With respect to #1, there is no need to clear the all-visible bit, >> only the all-frozen bit. However, that's a bit tricky given that we >> removed PD_ALL_FROZEN. Should we think about putting that back again? > > I think it's fine to just do the vm lookup. > >> Should we just clear all-visible and call it good enough? > > Given that we need to do that in heap_lock_tuple, which entirely > preserves all-visible (but shouldn't preserve all-frozen), ISTM we > better find something that doesn't invalidate all-visible. > Sounds logical, considering that we have a way to set all-frozen and vacuum does that as well. So probably either we need to have a new API or add a new parameter to visibilitymap_clear() to indicate the same. If we want to go that route, isn't it better to have PD_ALL_FROZEN as well? -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 7/1/16 3:43 PM, Andres Freund wrote: On 2016-07-01 15:42:22 -0500, Jim Nasby wrote: On 7/1/16 2:23 PM, Andres Freund wrote: The only cost of that is that vacuum will come along and mark the page all-visible again instead of skipping it, but that's probably not an enormous expense in most cases. I think the main cost is not having the page marked as all-visible for index-only purposes. If it's an insert mostly table, it can be a long while till vacuum comes around. ISTM that's something that should be addressed anyway (and separately), no? Huh? That's the current behaviour in heap_lock_tuple. Oh, I was referring to autovac not being aggressive enough on insert-mostly tables. Certainly if there's a reasonable way to avoid invalidating the VM when locking a tuple that'd be good. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 2016-07-01 15:42:22 -0500, Jim Nasby wrote: > On 7/1/16 2:23 PM, Andres Freund wrote: > > > > The only > > > > cost of that is that vacuum will come along and mark the page > > > > all-visible again instead of skipping it, but that's probably not an > > > > enormous expense in most cases. > > I think the main cost is not having the page marked as all-visible for > > index-only purposes. If it's an insert mostly table, it can be a long > > while till vacuum comes around. > > ISTM that's something that should be addressed anyway (and separately), no? Huh? That's the current behaviour in heap_lock_tuple. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 7/1/16 2:23 PM, Andres Freund wrote: > The only > cost of that is that vacuum will come along and mark the page > all-visible again instead of skipping it, but that's probably not an > enormous expense in most cases. I think the main cost is not having the page marked as all-visible for index-only purposes. If it's an insert mostly table, it can be a long while till vacuum comes around. ISTM that's something that should be addressed anyway (and separately), no? -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 2016-07-01 15:18:39 -0400, Robert Haas wrote: > On Fri, Jul 1, 2016 at 10:22 AM, Masahiko Sawada > wrote: > > Ah, you're right, I misunderstood. > > > > Attached updated patch incorporating your comments. > > I've changed it so that heap_xlog_lock clears vm flags if page is > > marked all frozen. > > I believe that this should be separated into two patches, since there > are two issues here: > > 1. Locking a tuple doesn't clear the all-frozen bit, but needs to do so. > 2. heap_update releases the buffer content lock without logging the > changes it has made. > > With respect to #1, there is no need to clear the all-visible bit, > only the all-frozen bit. However, that's a bit tricky given that we > removed PD_ALL_FROZEN. Should we think about putting that back again? I think it's fine to just do the vm lookup. > Should we just clear all-visible and call it good enough? Given that we need to do that in heap_lock_tuple, which entirely preserves all-visible (but shouldn't preserve all-frozen), ISTM we better find something that doesn't invalidate all-visible. > The only > cost of that is that vacuum will come along and mark the page > all-visible again instead of skipping it, but that's probably not an > enormous expense in most cases. I think the main cost is not having the page marked as all-visible for index-only purposes. If it's an insert mostly table, it can be a long while till vacuum comes around. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Fri, Jul 1, 2016 at 10:22 AM, Masahiko Sawada wrote: > Ah, you're right, I misunderstood. > > Attached updated patch incorporating your comments. > I've changed it so that heap_xlog_lock clears vm flags if page is > marked all frozen. I believe that this should be separated into two patches, since there are two issues here: 1. Locking a tuple doesn't clear the all-frozen bit, but needs to do so. 2. heap_update releases the buffer content lock without logging the changes it has made. With respect to #1, there is no need to clear the all-visible bit, only the all-frozen bit. However, that's a bit tricky given that we removed PD_ALL_FROZEN. Should we think about putting that back again? Should we just clear all-visible and call it good enough? The only cost of that is that vacuum will come along and mark the page all-visible again instead of skipping it, but that's probably not an enormous expense in most cases. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Fri, Jul 1, 2016 at 11:12 AM, Amit Kapila wrote: > On Thu, Jun 30, 2016 at 8:10 PM, Masahiko Sawada > wrote: >> On Thu, Jun 30, 2016 at 3:12 PM, Amit Kapila wrote: >>> On Thu, Jun 30, 2016 at 9:13 AM, Andres Freund wrote: On 2016-06-30 08:59:16 +0530, Amit Kapila wrote: > On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund > wrote: > > On 2016-06-29 19:04:31 +0530, Amit Kapila wrote: > >> There is nothing in this record which recorded the information about > >> visibility clear flag. > > > > I think we can actually defer the clearing to the lock release? > > How about the case if after we release the lock on page, the heap page > gets flushed, but not vm and then server crashes? In the released branches there's no need to clear all visible at that point. Note how heap_lock_tuple doesn't clear it at all. So we should be fine there, and that's the part where reusing an existing record is important (for compatibility). >>> >>> For back branches, I agree that heap_lock_tuple is sufficient, >> >> Even if we use heap_lock_tuple, If server crashed after flushed heap >> but not vm, after crash recovery the heap is still marked all-visible >> on vm. > > So, in this case both vm and page will be marked as all_visible. > >> This case could be happen even on released branched, and could make >> IndexOnlyScan returns wrong result? >> > > Why do you think IndexOnlyScan will return wrong result? If the > server crash in the way as you described, the transaction that has > made modifications will anyway be considered aborted, so the result of > IndexOnlyScan should not be wrong. > Ah, you're right, I misunderstood. Attached updated patch incorporating your comments. I've changed it so that heap_xlog_lock clears vm flags if page is marked all frozen. Regards, -- Masahiko Sawada emit_wal_already_marked_true_case_v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Thu, Jun 30, 2016 at 8:10 PM, Masahiko Sawada wrote: > On Thu, Jun 30, 2016 at 3:12 PM, Amit Kapila wrote: >> On Thu, Jun 30, 2016 at 9:13 AM, Andres Freund wrote: >>> On 2016-06-30 08:59:16 +0530, Amit Kapila wrote: On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund wrote: > On 2016-06-29 19:04:31 +0530, Amit Kapila wrote: >> There is nothing in this record which recorded the information about >> visibility clear flag. > > I think we can actually defer the clearing to the lock release? How about the case if after we release the lock on page, the heap page gets flushed, but not vm and then server crashes? >>> >>> In the released branches there's no need to clear all visible at that >>> point. Note how heap_lock_tuple doesn't clear it at all. So we should be >>> fine there, and that's the part where reusing an existing record is >>> important (for compatibility). >>> >> >> For back branches, I agree that heap_lock_tuple is sufficient, > > Even if we use heap_lock_tuple, If server crashed after flushed heap > but not vm, after crash recovery the heap is still marked all-visible > on vm. So, in this case both vm and page will be marked as all_visible. > This case could be happen even on released branched, and could make > IndexOnlyScan returns wrong result? > Why do you think IndexOnlyScan will return wrong result? If the server crash in the way as you described, the transaction that has made modifications will anyway be considered aborted, so the result of IndexOnlyScan should not be wrong. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Thu, Jun 30, 2016 at 3:12 PM, Amit Kapila wrote: > On Thu, Jun 30, 2016 at 9:13 AM, Andres Freund wrote: >> On 2016-06-30 08:59:16 +0530, Amit Kapila wrote: >>> On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund wrote: >>> > On 2016-06-29 19:04:31 +0530, Amit Kapila wrote: >>> >> There is nothing in this record which recorded the information about >>> >> visibility clear flag. >>> > >>> > I think we can actually defer the clearing to the lock release? >>> >>> How about the case if after we release the lock on page, the heap page >>> gets flushed, but not vm and then server crashes? >> >> In the released branches there's no need to clear all visible at that >> point. Note how heap_lock_tuple doesn't clear it at all. So we should be >> fine there, and that's the part where reusing an existing record is >> important (for compatibility). >> > > For back branches, I agree that heap_lock_tuple is sufficient, Even if we use heap_lock_tuple, If server crashed after flushed heap but not vm, after crash recovery the heap is still marked all-visible on vm. This case could be happen even on released branched, and could make IndexOnlyScan returns wrong result? Regards, -- Masahiko Sawada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Thu, Jun 30, 2016 at 9:13 AM, Andres Freund wrote: > On 2016-06-30 08:59:16 +0530, Amit Kapila wrote: >> On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund wrote: >> > On 2016-06-29 19:04:31 +0530, Amit Kapila wrote: >> >> There is nothing in this record which recorded the information about >> >> visibility clear flag. >> > >> > I think we can actually defer the clearing to the lock release? >> >> How about the case if after we release the lock on page, the heap page >> gets flushed, but not vm and then server crashes? > > In the released branches there's no need to clear all visible at that > point. Note how heap_lock_tuple doesn't clear it at all. So we should be > fine there, and that's the part where reusing an existing record is > important (for compatibility). > For back branches, I agree that heap_lock_tuple is sufficient, but in that case we should not clear the vm or page bit at all as done in proposed patch. > But your question made me realize that we despearately *do* need to > clear the frozen bit in heap_lock_tuple in 9.6... > Right. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 2016-06-30 08:59:16 +0530, Amit Kapila wrote: > On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund wrote: > > On 2016-06-29 19:04:31 +0530, Amit Kapila wrote: > >> There is nothing in this record which recorded the information about > >> visibility clear flag. > > > > I think we can actually defer the clearing to the lock release? > > How about the case if after we release the lock on page, the heap page > gets flushed, but not vm and then server crashes? In the released branches there's no need to clear all visible at that point. Note how heap_lock_tuple doesn't clear it at all. So we should be fine there, and that's the part where reusing an existing record is important (for compatibility). But your question made me realize that we despearately *do* need to clear the frozen bit in heap_lock_tuple in 9.6... Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Wed, Jun 29, 2016 at 10:30 PM, Andres Freund wrote: > On 2016-06-29 19:04:31 +0530, Amit Kapila wrote: >> There is nothing in this record which recorded the information about >> visibility clear flag. > > I think we can actually defer the clearing to the lock release? How about the case if after we release the lock on page, the heap page gets flushed, but not vm and then server crashes? After recovery, vacuum will never consider such a page for freezing as the vm bit still says all_frozen. Another possibility could be that WAL for xl_heap_lock got flushed, but not the heap page before crash, now after recovery, it will set the tuple with appropriate infomask and other flags, but the heap page will still be marked as ALL_VISIBLE. I think that can lead to a situation which Thomas Munro has reported upthread. All other cases in heapam.c, after clearing vm and corresponding flag in heap page, we are recording the same in WAL. Why to make this a different case and how is it safe to do it here and not at other places. > A tuple > being locked doesn't require the vm being cleared. > > >> I think in this approach, it is important to measure the performance >> of update, may be you can use simple-update option of pgbench for >> various workloads. Try it with different fill factors (-F fillfactor >> option in pgbench). > > Probably not sufficient, also needs toast activity, to show the really > bad case of many lock releases. Okay, makes sense. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 2016-06-29 19:04:31 +0530, Amit Kapila wrote: > There is nothing in this record which recorded the information about > visibility clear flag. I think we can actually defer the clearing to the lock release? A tuple being locked doesn't require the vm being cleared. > I think in this approach, it is important to measure the performance > of update, may be you can use simple-update option of pgbench for > various workloads. Try it with different fill factors (-F fillfactor > option in pgbench). Probably not sufficient, also needs toast activity, to show the really bad case of many lock releases. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Wed, Jun 29, 2016 at 11:14 AM, Masahiko Sawada wrote: > On Fri, Jun 24, 2016 at 11:04 AM, Amit Kapila wrote: >> On Fri, Jun 24, 2016 at 4:33 AM, Andres Freund wrote: >>> On 2016-06-23 18:59:57 -0400, Alvaro Herrera wrote: Andres Freund wrote: > I'm looking into three approaches right now: > > 3) Use WAL logging for the already_marked = true case. > 3) This approach so far seems the best. It's possible to reuse the > xl_heap_lock record (in an afaics backwards compatible manner), and in > most cases the overhead isn't that large. It's of course annoying to > emit more WAL, but it's not that big an overhead compared to extending a > file, or to toasting. It's also by far the simplest fix. >> >>> >> >> You are right, I think we can try such an optimization in Head and >> that too if we see a performance hit with adding this new WAL in >> heap_update. >> >> > > +1 for #3 approach, and attached draft patch for that. > I think attached patch would fix this problem but please let me know > if this patch is not what you're thinking. > Review comments: + if (RelationNeedsWAL(relation)) + { + xl_heap_lock xlrec; + XLogRecPtr recptr; + .. + xlrec.offnum = ItemPointerGetOffsetNumber(&oldtup.t_self); + xlrec.locking_xid = xid; + xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask, + oldtup.t_data->t_infomask2); + XLogRegisterData((char *) &xlrec, SizeOfHeapLock); + recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK); + PageSetLSN(page, recptr); + } There is nothing in this record which recorded the information about visibility clear flag. How will you ensure to clear the flag after crash? Have you considered to log cid using log_heap_new_cid() for logical decoding? It seems to me that the value of locking_xid should be xmax_old_tuple, why you have chosen xid? + /* Celar PD_ALL_VISIBLE flags */ + if (PageIsAllVisible(BufferGetPage(buffer))) + { + all_visible_cleared = true; + PageClearAllVisible(BufferGetPage(buffer)); + visibilitymap_clear(relation, BufferGetBlockNumber(buffer), + vmbuffer); + } + + MarkBufferDirty(buffer); + /* Clear obsolete visibility flags ... */ oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED); I think it is better to first update tuple related info and then clear PD_ALL_VISIBLE flags (for order, refer how we have done in heap_update in the code below where you are trying to add new code). Couple of typo's - /relasing/releasing /Celar/Clear I think in this approach, it is important to measure the performance of update, may be you can use simple-update option of pgbench for various workloads. Try it with different fill factors (-F fillfactor option in pgbench). -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Fri, Jun 24, 2016 at 11:04 AM, Amit Kapila wrote: > On Fri, Jun 24, 2016 at 4:33 AM, Andres Freund wrote: >> On 2016-06-23 18:59:57 -0400, Alvaro Herrera wrote: >>> Andres Freund wrote: >>> >>> > I'm looking into three approaches right now: >>> > >>> > 3) Use WAL logging for the already_marked = true case. >>> >>> >>> > 3) This approach so far seems the best. It's possible to reuse the >>> > xl_heap_lock record (in an afaics backwards compatible manner), and in >>> > most cases the overhead isn't that large. It's of course annoying to >>> > emit more WAL, but it's not that big an overhead compared to extending a >>> > file, or to toasting. It's also by far the simplest fix. >>> > > +1 for proceeding with Approach-3. > >>> I suppose it's fine if we crash midway from emitting this wal record and >>> the actual heap_update one, since the xmax will appear to come from an >>> aborted xid, right? >> >> Yea, that should be fine. >> >> >>> I agree that the overhead is probably negligible, considering that this >>> only happens when toast is invoked. It's probably not as great when the >>> new tuple goes to another page, though. >> >> I think it has to happen in both cases unfortunately. We could try to >> add some optimizations (e.g. only release lock & WAL log if the target >> page, via fsm, is before the current one), but I don't really want to go >> there in the back branches. >> > > You are right, I think we can try such an optimization in Head and > that too if we see a performance hit with adding this new WAL in > heap_update. > > +1 for #3 approach, and attached draft patch for that. I think attached patch would fix this problem but please let me know if this patch is not what you're thinking. Regards, -- Masahiko Sawada diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c index 57da57a..2f3fd83 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -3923,6 +3923,28 @@ l2: if (need_toast || newtupsize > pagefree) { + /* + * To prevent data corruption due to updating old tuple by + * other backends after released buffer, we need to emit that + * xmax of old tuple is set and clear visibility map bits if + * needed before relasing buffer. We can reuse xl_heap_lock + * for this pupose. It should be fine even if we crash midway + * from this section and the actual updating one later, since + * the xmax will appear to come from an aborted xid. + */ + START_CRIT_SECTION(); + + /* Celar PD_ALL_VISIBLE flags */ + if (PageIsAllVisible(BufferGetPage(buffer))) + { + all_visible_cleared = true; + PageClearAllVisible(BufferGetPage(buffer)); + visibilitymap_clear(relation, BufferGetBlockNumber(buffer), +vmbuffer); + } + + MarkBufferDirty(buffer); + /* Clear obsolete visibility flags ... */ oldtup.t_data->t_infomask &= ~(HEAP_XMAX_BITS | HEAP_MOVED); oldtup.t_data->t_infomask2 &= ~HEAP_KEYS_UPDATED; @@ -3936,6 +3958,26 @@ l2: /* temporarily make it look not-updated */ oldtup.t_data->t_ctid = oldtup.t_self; already_marked = true; + + if (RelationNeedsWAL(relation)) + { + xl_heap_lock xlrec; + XLogRecPtr recptr; + + XLogBeginInsert(); + XLogRegisterBuffer(0, buffer, REGBUF_STANDARD); + + xlrec.offnum = ItemPointerGetOffsetNumber(&oldtup.t_self); + xlrec.locking_xid = xid; + xlrec.infobits_set = compute_infobits(oldtup.t_data->t_infomask, + oldtup.t_data->t_infomask2); + XLogRegisterData((char *) &xlrec, SizeOfHeapLock); + recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_LOCK); + PageSetLSN(page, recptr); + } + + END_CRIT_SECTION(); + LockBuffer(buffer, BUFFER_LOCK_UNLOCK); /* -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Tue, Jun 28, 2016 at 8:06 PM, Masahiko Sawada wrote: > On Tue, Jun 21, 2016 at 6:59 AM, Andres Freund wrote: >> On 2016-06-20 17:55:19 -0400, Robert Haas wrote: >>> On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund wrote: >>> > On 2016-06-20 16:10:23 -0400, Robert Haas wrote: >>> >> What exactly is the point of all of that already_marked stuff? >>> > >>> > Preventing the old tuple from being locked/updated by another backend, >>> > while unlocking the buffer. >>> > >>> >> I >>> >> mean, suppose we just don't do any of that before we go off to do >>> >> toast_insert_or_update and RelationGetBufferForTuple. Eventually, >>> >> when we reacquire the page lock, we might find that somebody else has >>> >> already updated the tuple, but couldn't that be handled by >>> >> (approximately) looping back up to l2 just as we do in several other >>> >> cases? >>> > >>> > We'd potentially have to undo a fair amount more work: the toasted data >>> > would have to be deleted and such, just to retry. Which isn't going to >>> > super easy, because all of it will be happening with the current cid (we >>> > can't just increase CommandCounterIncrement() for correctness reasons). >>> >>> Why would we have to delete the TOAST data? AFAIUI, the tuple points >>> to the TOAST data, but not the other way around. So if we change our >>> mind about where to put the tuple, I don't think that requires >>> re-TOASTing. >> >> Consider what happens if we, after restarting at l2, notice that we >> can't actually insert, but return in the !HeapTupleMayBeUpdated >> branch. If the caller doesn't error out - and there's certainly callers >> doing that - we'd "leak" a toasted datum. > > Sorry for interrupt you, but I have a question about this case. > Is there case where we back to l2 after we created toasted > datum(called toast_insert_or_update)? > IIUC, after we stored toast datum we just insert heap tuple and log > WAL (or error out for some reasons). > I understood now, sorry for the noise. Regards, -- Masahiko Sawada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Tue, Jun 21, 2016 at 6:59 AM, Andres Freund wrote: > On 2016-06-20 17:55:19 -0400, Robert Haas wrote: >> On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund wrote: >> > On 2016-06-20 16:10:23 -0400, Robert Haas wrote: >> >> What exactly is the point of all of that already_marked stuff? >> > >> > Preventing the old tuple from being locked/updated by another backend, >> > while unlocking the buffer. >> > >> >> I >> >> mean, suppose we just don't do any of that before we go off to do >> >> toast_insert_or_update and RelationGetBufferForTuple. Eventually, >> >> when we reacquire the page lock, we might find that somebody else has >> >> already updated the tuple, but couldn't that be handled by >> >> (approximately) looping back up to l2 just as we do in several other >> >> cases? >> > >> > We'd potentially have to undo a fair amount more work: the toasted data >> > would have to be deleted and such, just to retry. Which isn't going to >> > super easy, because all of it will be happening with the current cid (we >> > can't just increase CommandCounterIncrement() for correctness reasons). >> >> Why would we have to delete the TOAST data? AFAIUI, the tuple points >> to the TOAST data, but not the other way around. So if we change our >> mind about where to put the tuple, I don't think that requires >> re-TOASTing. > > Consider what happens if we, after restarting at l2, notice that we > can't actually insert, but return in the !HeapTupleMayBeUpdated > branch. If the caller doesn't error out - and there's certainly callers > doing that - we'd "leak" a toasted datum. Sorry for interrupt you, but I have a question about this case. Is there case where we back to l2 after we created toasted datum(called toast_insert_or_update)? IIUC, after we stored toast datum we just insert heap tuple and log WAL (or error out for some reasons). Regards, -- Masahiko Sawada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Tue, Jun 21, 2016 at 10:59:25AM +1200, Thomas Munro wrote: > On Fri, Jun 17, 2016 at 3:36 PM, Noah Misch wrote: > > I agree the non-atomic, unlogged change is a problem. A related threat > > doesn't require a torn page: > > > > AssignTransactionId() xid=123 > > heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, 123); > > some ERROR before heap_update() finishes > > rollback; -- xid=123 > > some backend flushes the modified page > > immediate shutdown > > AssignTransactionId() xid=123 > > commit; -- xid=123 > > > > If nothing wrote an xlog record that witnesses xid 123, the cluster can > > reassign it after recovery. The failed update is now considered a > > successful > > update, and the row improperly becomes dead. That's important. > > I wonder if that was originally supposed to be handled with the > HEAP_XMAX_UNLOGGED flag which was removed in 11919160. A comment in > the heap WAL logging commit f2bfe8a2 said that tqual routines would > see the HEAP_XMAX_UNLOGGED flag in the event of a crash before logging > (though I'm not sure if the tqual routines ever actually did that). HEAP_XMAX_UNLOGGED does appear to have originated in contemplation of this same hazard. Looking at the three commits in "git log -S HEAP_XMAX_UNLOGGED" (f2bfe8a b58c041 1191916), nothing ever completed the implementation by testing for that flag. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Fri, Jun 24, 2016 at 4:33 AM, Andres Freund wrote: > On 2016-06-23 18:59:57 -0400, Alvaro Herrera wrote: >> Andres Freund wrote: >> >> > I'm looking into three approaches right now: >> > >> > 3) Use WAL logging for the already_marked = true case. >> >> >> > 3) This approach so far seems the best. It's possible to reuse the >> > xl_heap_lock record (in an afaics backwards compatible manner), and in >> > most cases the overhead isn't that large. It's of course annoying to >> > emit more WAL, but it's not that big an overhead compared to extending a >> > file, or to toasting. It's also by far the simplest fix. >> +1 for proceeding with Approach-3. >> I suppose it's fine if we crash midway from emitting this wal record and >> the actual heap_update one, since the xmax will appear to come from an >> aborted xid, right? > > Yea, that should be fine. > > >> I agree that the overhead is probably negligible, considering that this >> only happens when toast is invoked. It's probably not as great when the >> new tuple goes to another page, though. > > I think it has to happen in both cases unfortunately. We could try to > add some optimizations (e.g. only release lock & WAL log if the target > page, via fsm, is before the current one), but I don't really want to go > there in the back branches. > You are right, I think we can try such an optimization in Head and that too if we see a performance hit with adding this new WAL in heap_update. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 2016-06-23 18:59:57 -0400, Alvaro Herrera wrote: > Andres Freund wrote: > > > I'm looking into three approaches right now: > > > > 3) Use WAL logging for the already_marked = true case. > > > > 3) This approach so far seems the best. It's possible to reuse the > > xl_heap_lock record (in an afaics backwards compatible manner), and in > > most cases the overhead isn't that large. It's of course annoying to > > emit more WAL, but it's not that big an overhead compared to extending a > > file, or to toasting. It's also by far the simplest fix. > > I suppose it's fine if we crash midway from emitting this wal record and > the actual heap_update one, since the xmax will appear to come from an > aborted xid, right? Yea, that should be fine. > I agree that the overhead is probably negligible, considering that this > only happens when toast is invoked. It's probably not as great when the > new tuple goes to another page, though. I think it has to happen in both cases unfortunately. We could try to add some optimizations (e.g. only release lock & WAL log if the target page, via fsm, is before the current one), but I don't really want to go there in the back branches. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
Andres Freund wrote: > I'm looking into three approaches right now: > > 3) Use WAL logging for the already_marked = true case. > 3) This approach so far seems the best. It's possible to reuse the > xl_heap_lock record (in an afaics backwards compatible manner), and in > most cases the overhead isn't that large. It's of course annoying to > emit more WAL, but it's not that big an overhead compared to extending a > file, or to toasting. It's also by far the simplest fix. I suppose it's fine if we crash midway from emitting this wal record and the actual heap_update one, since the xmax will appear to come from an aborted xid, right? I agree that the overhead is probably negligible, considering that this only happens when toast is invoked. It's probably not as great when the new tuple goes to another page, though. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 2016-06-21 16:32:03 -0400, Robert Haas wrote: > On Tue, Jun 21, 2016 at 3:46 PM, Andres Freund wrote: > > On 2016-06-21 15:38:25 -0400, Robert Haas wrote: > >> On Tue, Jun 21, 2016 at 1:49 PM, Andres Freund wrote: > >> >> I'm also a bit dubious that LockAcquire is safe to call in general > >> >> with interrupts held. > >> > > >> > Looks like we could just acquire the tuple-lock *before* doing the > >> > toast_insert_or_update/RelationGetBufferForTuple, but after releasing > >> > the buffer lock. That'd allow us to do avoid doing the nested locking, > >> > should make the recovery just a goto l2;, ... > >> > >> Why isn't that racey? Somebody else can grab the tuple lock after we > >> release the buffer content lock and before we acquire the tuple lock. > > > > Sure, but by the time the tuple lock is released, they'd have updated > > xmax. So once we acquired that we can just do > > if (xmax_infomask_changed(oldtup.t_data->t_infomask, > > infomask) > > || > > > > !TransactionIdEquals(HeapTupleHeaderGetRawXmax(oldtup.t_data), > > xwait)) > > goto l2; > > which is fine, because we've not yet done the toasting. > > I see. > > > I'm not sure wether this approach is better than deleting potentially > > toasted data though. It's probably faster, but will likely touch more > > places in the code, and eat up a infomask bit (infomask & HEAP_MOVED > > == HEAP_MOVED in my prototype). > > Ugh. That's not very desirable at all. I'm looking into three approaches right now: 1) Flag approach from above 2) Undo toasting on concurrent activity, retry 3) Use WAL logging for the already_marked = true case. 1) primarily suffers from a significant amount of complexity. I still have a bug in there that sometimes triggers "attempted to update invisible tuple" ERRORs. Otherwise it seems to perform decently performancewise - even on workloads with many backends hitting the same tuple, the retry-rate is low. 2) Seems to work too, but due to the amount of time the tuple is not locked, the retry rate can be really high. As we perform significant amount of work (toast insertion & index manipulation or extending a file) , while the tuple is not locked, it's quite likely that another session tries to modify the tuple inbetween. I think it's possible to essentially livelock. 3) This approach so far seems the best. It's possible to reuse the xl_heap_lock record (in an afaics backwards compatible manner), and in most cases the overhead isn't that large. It's of course annoying to emit more WAL, but it's not that big an overhead compared to extending a file, or to toasting. It's also by far the simplest fix. Comments? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Tue, Jun 21, 2016 at 3:46 PM, Andres Freund wrote: > On 2016-06-21 15:38:25 -0400, Robert Haas wrote: >> On Tue, Jun 21, 2016 at 1:49 PM, Andres Freund wrote: >> >> I'm also a bit dubious that LockAcquire is safe to call in general >> >> with interrupts held. >> > >> > Looks like we could just acquire the tuple-lock *before* doing the >> > toast_insert_or_update/RelationGetBufferForTuple, but after releasing >> > the buffer lock. That'd allow us to do avoid doing the nested locking, >> > should make the recovery just a goto l2;, ... >> >> Why isn't that racey? Somebody else can grab the tuple lock after we >> release the buffer content lock and before we acquire the tuple lock. > > Sure, but by the time the tuple lock is released, they'd have updated > xmax. So once we acquired that we can just do > if (xmax_infomask_changed(oldtup.t_data->t_infomask, > infomask) || > > !TransactionIdEquals(HeapTupleHeaderGetRawXmax(oldtup.t_data), > xwait)) > goto l2; > which is fine, because we've not yet done the toasting. I see. > I'm not sure wether this approach is better than deleting potentially > toasted data though. It's probably faster, but will likely touch more > places in the code, and eat up a infomask bit (infomask & HEAP_MOVED > == HEAP_MOVED in my prototype). Ugh. That's not very desirable at all. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 2016-06-21 15:38:25 -0400, Robert Haas wrote: > On Tue, Jun 21, 2016 at 1:49 PM, Andres Freund wrote: > >> I'm also a bit dubious that LockAcquire is safe to call in general > >> with interrupts held. > > > > Looks like we could just acquire the tuple-lock *before* doing the > > toast_insert_or_update/RelationGetBufferForTuple, but after releasing > > the buffer lock. That'd allow us to do avoid doing the nested locking, > > should make the recovery just a goto l2;, ... > > Why isn't that racey? Somebody else can grab the tuple lock after we > release the buffer content lock and before we acquire the tuple lock. Sure, but by the time the tuple lock is released, they'd have updated xmax. So once we acquired that we can just do if (xmax_infomask_changed(oldtup.t_data->t_infomask, infomask) || !TransactionIdEquals(HeapTupleHeaderGetRawXmax(oldtup.t_data), xwait)) goto l2; which is fine, because we've not yet done the toasting. I'm not sure wether this approach is better than deleting potentially toasted data though. It's probably faster, but will likely touch more places in the code, and eat up a infomask bit (infomask & HEAP_MOVED == HEAP_MOVED in my prototype). Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Tue, Jun 21, 2016 at 1:49 PM, Andres Freund wrote: >> I'm also a bit dubious that LockAcquire is safe to call in general >> with interrupts held. > > Looks like we could just acquire the tuple-lock *before* doing the > toast_insert_or_update/RelationGetBufferForTuple, but after releasing > the buffer lock. That'd allow us to do avoid doing the nested locking, > should make the recovery just a goto l2;, ... Why isn't that racey? Somebody else can grab the tuple lock after we release the buffer content lock and before we acquire the tuple lock. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 2016-06-21 13:03:24 -0400, Robert Haas wrote: > On Tue, Jun 21, 2016 at 12:54 PM, Andres Freund wrote: > >> I don't quite understand the intended semantics of this proposed flag. > > > > Whenever the flag is set, we have to acquire the heavyweight tuple lock > > before continuing. That guarantees nobody else can modify the tuple, > > while the lock is released, without requiring to modify more than one > > hint bit. That should fix the torn page issue, no? > > Yeah, I guess that would work. > > >> If we don't already have the tuple lock at that point, we can't go > >> acquire it before releasing the content lock, can we? > > > > Why not? Afaics the way that tuple locks are used, the nesting should > > be fine. > > Well, the existing places where we acquire the tuple lock within > heap_update() are all careful to release the page lock first, so I'm > skeptical that doing it the other order is safe. Certainly, if we've > got some code that grabs the page lock and then the tuple lock and > other code that grabs the tuple lock and then the page lock, that's a > deadlock waiting to happen. Just noticed this piece of code while looking into this: UnlockReleaseBuffer(buffer); if (have_tuple_lock) UnlockTupleTuplock(relation, &(tp.t_self), LockTupleExclusive); if (vmbuffer != InvalidBuffer) ReleaseBuffer(vmbuffer); return result; seems weird to release the vmbuffer after the tuplelock... > I'm also a bit dubious that LockAcquire is safe to call in general > with interrupts held. Looks like we could just acquire the tuple-lock *before* doing the toast_insert_or_update/RelationGetBufferForTuple, but after releasing the buffer lock. That'd allow us to do avoid doing the nested locking, should make the recovery just a goto l2;, ... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Tue, Jun 21, 2016 at 12:54 PM, Andres Freund wrote: >> I don't quite understand the intended semantics of this proposed flag. > > Whenever the flag is set, we have to acquire the heavyweight tuple lock > before continuing. That guarantees nobody else can modify the tuple, > while the lock is released, without requiring to modify more than one > hint bit. That should fix the torn page issue, no? Yeah, I guess that would work. >> If we don't already have the tuple lock at that point, we can't go >> acquire it before releasing the content lock, can we? > > Why not? Afaics the way that tuple locks are used, the nesting should > be fine. Well, the existing places where we acquire the tuple lock within heap_update() are all careful to release the page lock first, so I'm skeptical that doing it the other order is safe. Certainly, if we've got some code that grabs the page lock and then the tuple lock and other code that grabs the tuple lock and then the page lock, that's a deadlock waiting to happen. I'm also a bit dubious that LockAcquire is safe to call in general with interrupts held. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 2016-06-21 10:50:36 -0400, Robert Haas wrote: > On Mon, Jun 20, 2016 at 11:51 PM, Andres Freund wrote: > >> > So far the best idea I have - and it's really not a good one - is to > >> > invent a new hint-bit that tells concurrent updates to acquire a > >> > heavyweight tuple lock, while releasing the page-level lock. If that > >> > hint bit does not require any other modifications - and we don't need an > >> > xid in xmax for this use case - that'll avoid doing all the other > >> > `already_marked` stuff early, which should address the correctness > >> > issue. > >> > > >> > >> Don't we need to clear such a flag in case of error? Also don't we need to > >> reset it later, like when modifying the old page later before WAL. > > > > If the flag just says "acquire a heavyweight lock", then there's no need > > for that. That's cheap enough to just do if it's errorneously set. At > > least I can't see any reason. > > I don't quite understand the intended semantics of this proposed flag. Whenever the flag is set, we have to acquire the heavyweight tuple lock before continuing. That guarantees nobody else can modify the tuple, while the lock is released, without requiring to modify more than one hint bit. That should fix the torn page issue, no? > If we don't already have the tuple lock at that point, we can't go > acquire it before releasing the content lock, can we? Why not? Afaics the way that tuple locks are used, the nesting should be fine. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Tue, Jun 21, 2016 at 10:47 AM, Tom Lane wrote: > Robert Haas writes: >> On Mon, Jun 20, 2016 at 5:59 PM, Andres Freund wrote: >>> Consider what happens if we, after restarting at l2, notice that we >>> can't actually insert, but return in the !HeapTupleMayBeUpdated >>> branch. > >> OK, I see what you mean. Still, that doesn't seem like such a >> terrible cost. If you try to update a tuple and if it looks like you >> can update it but then after TOASTing you find that the status of the >> tuple has changed such that you can't update it after all, then you >> might need to go set xmax = MyTxid() on all of the TOAST tuples you >> created (whose CTIDs we could save someplace, so that it's just a >> matter of finding them by CTID to kill them). > > ... and if you get an error or crash partway through that, what happens? Then the transaction is aborted anyway, and we haven't leaked anything because VACUUM will clean it up. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Mon, Jun 20, 2016 at 11:51 PM, Andres Freund wrote: >> > So far the best idea I have - and it's really not a good one - is to >> > invent a new hint-bit that tells concurrent updates to acquire a >> > heavyweight tuple lock, while releasing the page-level lock. If that >> > hint bit does not require any other modifications - and we don't need an >> > xid in xmax for this use case - that'll avoid doing all the other >> > `already_marked` stuff early, which should address the correctness >> > issue. >> > >> >> Don't we need to clear such a flag in case of error? Also don't we need to >> reset it later, like when modifying the old page later before WAL. > > If the flag just says "acquire a heavyweight lock", then there's no need > for that. That's cheap enough to just do if it's errorneously set. At > least I can't see any reason. I don't quite understand the intended semantics of this proposed flag. If we don't already have the tuple lock at that point, we can't go acquire it before releasing the content lock, can we? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
Robert Haas writes: > On Mon, Jun 20, 2016 at 5:59 PM, Andres Freund wrote: >> Consider what happens if we, after restarting at l2, notice that we >> can't actually insert, but return in the !HeapTupleMayBeUpdated >> branch. > OK, I see what you mean. Still, that doesn't seem like such a > terrible cost. If you try to update a tuple and if it looks like you > can update it but then after TOASTing you find that the status of the > tuple has changed such that you can't update it after all, then you > might need to go set xmax = MyTxid() on all of the TOAST tuples you > created (whose CTIDs we could save someplace, so that it's just a > matter of finding them by CTID to kill them). ... and if you get an error or crash partway through that, what happens? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Mon, Jun 20, 2016 at 5:59 PM, Andres Freund wrote: > On 2016-06-20 17:55:19 -0400, Robert Haas wrote: >> On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund wrote: >> > On 2016-06-20 16:10:23 -0400, Robert Haas wrote: >> >> I >> >> mean, suppose we just don't do any of that before we go off to do >> >> toast_insert_or_update and RelationGetBufferForTuple. Eventually, >> >> when we reacquire the page lock, we might find that somebody else has >> >> already updated the tuple, but couldn't that be handled by >> >> (approximately) looping back up to l2 just as we do in several other >> >> cases? >> > >> > We'd potentially have to undo a fair amount more work: the toasted data >> > would have to be deleted and such, just to retry. Which isn't going to >> > super easy, because all of it will be happening with the current cid (we >> > can't just increase CommandCounterIncrement() for correctness reasons). >> >> Why would we have to delete the TOAST data? AFAIUI, the tuple points >> to the TOAST data, but not the other way around. So if we change our >> mind about where to put the tuple, I don't think that requires >> re-TOASTing. > > Consider what happens if we, after restarting at l2, notice that we > can't actually insert, but return in the !HeapTupleMayBeUpdated > branch. If the caller doesn't error out - and there's certainly callers > doing that - we'd "leak" a toasted datum. Unless the transaction aborts, > the toasted datum would never be cleaned up, because there's no datum > pointing to it, so no heap_delete will ever recurse into the toast > datum (via toast_delete()). OK, I see what you mean. Still, that doesn't seem like such a terrible cost. If you try to update a tuple and if it looks like you can update it but then after TOASTing you find that the status of the tuple has changed such that you can't update it after all, then you might need to go set xmax = MyTxid() on all of the TOAST tuples you created (whose CTIDs we could save someplace, so that it's just a matter of finding them by CTID to kill them). That's not likely to happen particularly often, though, and when it does happen it's not insanely expensive. We could also reduce the cost by letting the caller of heap_update() decide whether to back out the work; if the caller intends to throw an error anyway, then there's no point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Tue, Jun 21, 2016 at 9:21 AM, Andres Freund wrote: > > On 2016-06-21 08:59:13 +0530, Amit Kapila wrote: > > Can we consider to use some strategy to avoid deadlocks without releasing > > the lock on old page? Consider if we could have a mechanism such that > > RelationGetBufferForTuple() will ensure that it always returns a new buffer > > which has targetblock greater than the old block (on which we already held > > a lock). I think here tricky part is whether we can get anything like that > > from FSM. Also, there could be cases where we need to extend the heap when > > there were pages in heap with space available, but we have ignored them > > because there block number is smaller than the block number on which we > > have lock. > > I can't see that being acceptable, from a space-usage POV. > > > > So far the best idea I have - and it's really not a good one - is to > > > invent a new hint-bit that tells concurrent updates to acquire a > > > heavyweight tuple lock, while releasing the page-level lock. If that > > > hint bit does not require any other modifications - and we don't need an > > > xid in xmax for this use case - that'll avoid doing all the other > > > `already_marked` stuff early, which should address the correctness > > > issue. > > > > > > > Don't we need to clear such a flag in case of error? Also don't we need to > > reset it later, like when modifying the old page later before WAL. > > If the flag just says "acquire a heavyweight lock", then there's no need > for that. That's cheap enough to just do if it's errorneously set. At > least I can't see any reason. > I think it will just increase the chances of other backends to acquire a heavy weight lock. > > > It's kinda invasive though, and probably has performance > > > implications. > > > > > > > Do you see performance implication due to requirement of heavywieht tuple > > lock in more cases than now or something else? > > Because of that, yes. > > > > Some others ways could be: > > > > Before releasing the lock on buffer containing old tuple, clear the VM and > > visibility info from page and WAL log it. I think this could impact > > performance depending on how frequently we need to perform this action. > > Doubling the number of xlog inserts in heap_update would certainly be > measurable :(. My guess is that the heavyweight tuple lock approach will > be less expensive. > Probably, but I think heavyweight tuple lock is more invasive. I think increasing the number of xlog inserts could surely impact performance, but depending upon how frequently we need to call it. I think we might want to combine it with the idea of RelationGetBufferForTuple() to return higher-block number, such that if we don't find higher block-number from FSM, then we can release the lock on old page and try to get the locks on old and new buffers as we do now. This will further reduce the chances of increasing xlog insert calls and address the issue of space-wastage. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Reviewing freeze map code
On Tue, Jun 21, 2016 at 9:16 AM, Thomas Munro wrote: > > On Tue, Jun 21, 2016 at 3:29 PM, Amit Kapila wrote: > > Some others ways could be: > > > > Before releasing the lock on buffer containing old tuple, clear the VM and > > visibility info from page and WAL log it. I think this could impact > > performance depending on how frequently we need to perform this action. > > > > Have a new flag like HEAP_XMAX_UNLOGGED (as it was there when this logic was > > introduced in commit f2bfe8a24c46133f81e188653a127f939eb33c4a ) and set the > > same in old tuple header before releasing lock on buffer and teach tqual.c > > to honor the flag. I think tqual.c should consider HEAP_XMAX_UNLOGGED as > > an indication of aborted transaction unless it is currently in-progress. > > Also, I think we need to clear this flag before WAL logging in heap_update. > > I also noticed that and wondered whether it was a mistake to take that > out. It appears to have been removed as part of the logic to clear > away UNDO log support in 11919160, but it may have been an important > part of the heap_update protocol. Though (as I mentioned nearby in a > reply to Noah) it I'm not sure if the tqual.c side which would ignore > the unlogged xmax in the event of a badly timed crash was ever > implemented. > Right, my observation is similar to yours and that's what I am suggesting as one-alternative to fix this issue. I think making this approach work (even if this doesn't have any problems) might turn out to be tricky. However, the plus-point of this approach seems to be that it shouldn't impact performance in most of the cases. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Reviewing freeze map code
On Tue, Jun 21, 2016 at 9:08 AM, Thomas Munro wrote: > > On Tue, Jun 21, 2016 at 3:29 PM, Amit Kapila wrote: > > On Tue, Jun 21, 2016 at 1:03 AM, Andres Freund wrote: > >> Well, I think generally nobody seriously looked at actually refactoring > >> heap_update(), even though that'd be a good idea. But in this instance, > >> the problem seems relatively fundamental: > >> > >> We need to lock the origin page, to do visibility checks, etc. Then we > >> need to figure out the target page. Even disregarding toasting - which > >> we could be doing earlier with some refactoring - we're going to have to > >> release the page level lock, to lock them in ascending order. Otherwise > >> we'll risk kinda likely deadlocks. > > > > Can we consider to use some strategy to avoid deadlocks without releasing > > the lock on old page? Consider if we could have a mechanism such that > > RelationGetBufferForTuple() will ensure that it always returns a new buffer > > which has targetblock greater than the old block (on which we already held a > > lock). I think here tricky part is whether we can get anything like that > > from FSM. Also, there could be cases where we need to extend the heap when > > there were pages in heap with space available, but we have ignored them > > because there block number is smaller than the block number on which we have > > lock. > > Doesn't that mean that over time, given a workload that does only or > mostly updates, your records tend to migrate further and further away > from the start of the file, leaving a growing unusable space at the > beginning, until you eventually need to CLUSTER/VACUUM FULL? > The request for updates should ideally fit in same page as old tuple for many of the cases if fillfactor is properly configured, considering update-mostly loads. Why would it be that always the records will migrate further away, they should get the space freed by other updates in intermediate pages. I think there could be some impact space-wise, but freed-up space will be eventually used. > I was wondering about speculatively asking for a free page with a > lower block number than the origin page, if one is available, before > locking the origin page. Do you wan't to lock it as well? In any-case, I think adding the code without deciding whether the update can be accommodated in current page can prove to be costly. > Then after locking the origin page, if it > turns out you need a page but didn't get it earlier, asking for a free > page with a higher block number than the origin page. > Something like that might workout if it is feasible and people agree on pursuing such an approach. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Reviewing freeze map code
On 2016-06-21 08:59:13 +0530, Amit Kapila wrote: > Can we consider to use some strategy to avoid deadlocks without releasing > the lock on old page? Consider if we could have a mechanism such that > RelationGetBufferForTuple() will ensure that it always returns a new buffer > which has targetblock greater than the old block (on which we already held > a lock). I think here tricky part is whether we can get anything like that > from FSM. Also, there could be cases where we need to extend the heap when > there were pages in heap with space available, but we have ignored them > because there block number is smaller than the block number on which we > have lock. I can't see that being acceptable, from a space-usage POV. > > So far the best idea I have - and it's really not a good one - is to > > invent a new hint-bit that tells concurrent updates to acquire a > > heavyweight tuple lock, while releasing the page-level lock. If that > > hint bit does not require any other modifications - and we don't need an > > xid in xmax for this use case - that'll avoid doing all the other > > `already_marked` stuff early, which should address the correctness > > issue. > > > > Don't we need to clear such a flag in case of error? Also don't we need to > reset it later, like when modifying the old page later before WAL. If the flag just says "acquire a heavyweight lock", then there's no need for that. That's cheap enough to just do if it's errorneously set. At least I can't see any reason. > > It's kinda invasive though, and probably has performance > > implications. > > > > Do you see performance implication due to requirement of heavywieht tuple > lock in more cases than now or something else? Because of that, yes. > Some others ways could be: > > Before releasing the lock on buffer containing old tuple, clear the VM and > visibility info from page and WAL log it. I think this could impact > performance depending on how frequently we need to perform this action. Doubling the number of xlog inserts in heap_update would certainly be measurable :(. My guess is that the heavyweight tuple lock approach will be less expensive. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Tue, Jun 21, 2016 at 3:29 PM, Amit Kapila wrote: > Some others ways could be: > > Before releasing the lock on buffer containing old tuple, clear the VM and > visibility info from page and WAL log it. I think this could impact > performance depending on how frequently we need to perform this action. > > Have a new flag like HEAP_XMAX_UNLOGGED (as it was there when this logic was > introduced in commit f2bfe8a24c46133f81e188653a127f939eb33c4a ) and set the > same in old tuple header before releasing lock on buffer and teach tqual.c > to honor the flag. I think tqual.c should consider HEAP_XMAX_UNLOGGED as > an indication of aborted transaction unless it is currently in-progress. > Also, I think we need to clear this flag before WAL logging in heap_update. I also noticed that and wondered whether it was a mistake to take that out. It appears to have been removed as part of the logic to clear away UNDO log support in 11919160, but it may have been an important part of the heap_update protocol. Though (as I mentioned nearby in a reply to Noah) it I'm not sure if the tqual.c side which would ignore the unlogged xmax in the event of a badly timed crash was ever implemented. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Tue, Jun 21, 2016 at 3:29 PM, Amit Kapila wrote: > On Tue, Jun 21, 2016 at 1:03 AM, Andres Freund wrote: >> Well, I think generally nobody seriously looked at actually refactoring >> heap_update(), even though that'd be a good idea. But in this instance, >> the problem seems relatively fundamental: >> >> We need to lock the origin page, to do visibility checks, etc. Then we >> need to figure out the target page. Even disregarding toasting - which >> we could be doing earlier with some refactoring - we're going to have to >> release the page level lock, to lock them in ascending order. Otherwise >> we'll risk kinda likely deadlocks. > > Can we consider to use some strategy to avoid deadlocks without releasing > the lock on old page? Consider if we could have a mechanism such that > RelationGetBufferForTuple() will ensure that it always returns a new buffer > which has targetblock greater than the old block (on which we already held a > lock). I think here tricky part is whether we can get anything like that > from FSM. Also, there could be cases where we need to extend the heap when > there were pages in heap with space available, but we have ignored them > because there block number is smaller than the block number on which we have > lock. Doesn't that mean that over time, given a workload that does only or mostly updates, your records tend to migrate further and further away from the start of the file, leaving a growing unusable space at the beginning, until you eventually need to CLUSTER/VACUUM FULL? I was wondering about speculatively asking for a free page with a lower block number than the origin page, if one is available, before locking the origin page. Then after locking the origin page, if it turns out you need a page but didn't get it earlier, asking for a free page with a higher block number than the origin page. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Tue, Jun 21, 2016 at 1:03 AM, Andres Freund wrote: > > On 2016-06-15 08:56:52 -0400, Robert Haas wrote: > > Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2, > > and CTID without logging anything or clearing the all-visible flag and > > then releases the lock on the heap page to go do some more work that > > might even ERROR out. Only if that other work all goes OK do we > > relock the page and perform the WAL-logged actions. > > > > That doesn't seem like a good idea even in existing releases, because > > you've taken a tuple on an all-visible page and made it not > > all-visible, and you've made a page modification that is not > > necessarily atomic without logging it. > > Right, that's broken. > > > > I'm not sure what to do about this: this part of the heap_update() > > logic has been like this forever, and I assume that if it were easy to > > refactor this away, somebody would have done it by now. > > Well, I think generally nobody seriously looked at actually refactoring > heap_update(), even though that'd be a good idea. But in this instance, > the problem seems relatively fundamental: > > We need to lock the origin page, to do visibility checks, etc. Then we > need to figure out the target page. Even disregarding toasting - which > we could be doing earlier with some refactoring - we're going to have to > release the page level lock, to lock them in ascending order. Otherwise > we'll risk kinda likely deadlocks. > Can we consider to use some strategy to avoid deadlocks without releasing the lock on old page? Consider if we could have a mechanism such that RelationGetBufferForTuple() will ensure that it always returns a new buffer which has targetblock greater than the old block (on which we already held a lock). I think here tricky part is whether we can get anything like that from FSM. Also, there could be cases where we need to extend the heap when there were pages in heap with space available, but we have ignored them because there block number is smaller than the block number on which we have lock. > We also certainly don't want to nest > the lwlocks for the toast stuff, inside a content lock for the old > tupe's page. > > So far the best idea I have - and it's really not a good one - is to > invent a new hint-bit that tells concurrent updates to acquire a > heavyweight tuple lock, while releasing the page-level lock. If that > hint bit does not require any other modifications - and we don't need an > xid in xmax for this use case - that'll avoid doing all the other > `already_marked` stuff early, which should address the correctness > issue. > Don't we need to clear such a flag in case of error? Also don't we need to reset it later, like when modifying the old page later before WAL. > It's kinda invasive though, and probably has performance > implications. > Do you see performance implication due to requirement of heavywieht tuple lock in more cases than now or something else? Some others ways could be: Before releasing the lock on buffer containing old tuple, clear the VM and visibility info from page and WAL log it. I think this could impact performance depending on how frequently we need to perform this action. Have a new flag like HEAP_XMAX_UNLOGGED (as it was there when this logic was introduced in commit f2bfe8a24c46133f81e188653a127f939eb33c4a ) and set the same in old tuple header before releasing lock on buffer and teach tqual.c to honor the flag. I think tqual.c should consider HEAP_XMAX_UNLOGGED as an indication of aborted transaction unless it is currently in-progress. Also, I think we need to clear this flag before WAL logging in heap_update. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Reviewing freeze map code
On Fri, Jun 17, 2016 at 3:36 PM, Noah Misch wrote: > I agree the non-atomic, unlogged change is a problem. A related threat > doesn't require a torn page: > > AssignTransactionId() xid=123 > heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, 123); > some ERROR before heap_update() finishes > rollback; -- xid=123 > some backend flushes the modified page > immediate shutdown > AssignTransactionId() xid=123 > commit; -- xid=123 > > If nothing wrote an xlog record that witnesses xid 123, the cluster can > reassign it after recovery. The failed update is now considered a successful > update, and the row improperly becomes dead. That's important. I wonder if that was originally supposed to be handled with the HEAP_XMAX_UNLOGGED flag which was removed in 11919160. A comment in the heap WAL logging commit f2bfe8a2 said that tqual routines would see the HEAP_XMAX_UNLOGGED flag in the event of a crash before logging (though I'm not sure if the tqual routines ever actually did that). -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 2016-06-20 17:55:19 -0400, Robert Haas wrote: > On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund wrote: > > On 2016-06-20 16:10:23 -0400, Robert Haas wrote: > >> What exactly is the point of all of that already_marked stuff? > > > > Preventing the old tuple from being locked/updated by another backend, > > while unlocking the buffer. > > > >> I > >> mean, suppose we just don't do any of that before we go off to do > >> toast_insert_or_update and RelationGetBufferForTuple. Eventually, > >> when we reacquire the page lock, we might find that somebody else has > >> already updated the tuple, but couldn't that be handled by > >> (approximately) looping back up to l2 just as we do in several other > >> cases? > > > > We'd potentially have to undo a fair amount more work: the toasted data > > would have to be deleted and such, just to retry. Which isn't going to > > super easy, because all of it will be happening with the current cid (we > > can't just increase CommandCounterIncrement() for correctness reasons). > > Why would we have to delete the TOAST data? AFAIUI, the tuple points > to the TOAST data, but not the other way around. So if we change our > mind about where to put the tuple, I don't think that requires > re-TOASTing. Consider what happens if we, after restarting at l2, notice that we can't actually insert, but return in the !HeapTupleMayBeUpdated branch. If the caller doesn't error out - and there's certainly callers doing that - we'd "leak" a toasted datum. Unless the transaction aborts, the toasted datum would never be cleaned up, because there's no datum pointing to it, so no heap_delete will ever recurse into the toast datum (via toast_delete()). Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Mon, Jun 20, 2016 at 4:24 PM, Andres Freund wrote: > On 2016-06-20 16:10:23 -0400, Robert Haas wrote: >> What exactly is the point of all of that already_marked stuff? > > Preventing the old tuple from being locked/updated by another backend, > while unlocking the buffer. > >> I >> mean, suppose we just don't do any of that before we go off to do >> toast_insert_or_update and RelationGetBufferForTuple. Eventually, >> when we reacquire the page lock, we might find that somebody else has >> already updated the tuple, but couldn't that be handled by >> (approximately) looping back up to l2 just as we do in several other >> cases? > > We'd potentially have to undo a fair amount more work: the toasted data > would have to be deleted and such, just to retry. Which isn't going to > super easy, because all of it will be happening with the current cid (we > can't just increase CommandCounterIncrement() for correctness reasons). Why would we have to delete the TOAST data? AFAIUI, the tuple points to the TOAST data, but not the other way around. So if we change our mind about where to put the tuple, I don't think that requires re-TOASTing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 2016-06-20 16:10:23 -0400, Robert Haas wrote: > What exactly is the point of all of that already_marked stuff? Preventing the old tuple from being locked/updated by another backend, while unlocking the buffer. > I > mean, suppose we just don't do any of that before we go off to do > toast_insert_or_update and RelationGetBufferForTuple. Eventually, > when we reacquire the page lock, we might find that somebody else has > already updated the tuple, but couldn't that be handled by > (approximately) looping back up to l2 just as we do in several other > cases? We'd potentially have to undo a fair amount more work: the toasted data would have to be deleted and such, just to retry. Which isn't going to super easy, because all of it will be happening with the current cid (we can't just increase CommandCounterIncrement() for correctness reasons). Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Mon, Jun 20, 2016 at 3:33 PM, Andres Freund wrote: >> I'm not sure what to do about this: this part of the heap_update() >> logic has been like this forever, and I assume that if it were easy to >> refactor this away, somebody would have done it by now. > > Well, I think generally nobody seriously looked at actually refactoring > heap_update(), even though that'd be a good idea. But in this instance, > the problem seems relatively fundamental: > > We need to lock the origin page, to do visibility checks, etc. Then we > need to figure out the target page. Even disregarding toasting - which > we could be doing earlier with some refactoring - we're going to have to > release the page level lock, to lock them in ascending order. Otherwise > we'll risk kinda likely deadlocks. We also certainly don't want to nest > the lwlocks for the toast stuff, inside a content lock for the old > tupe's page. > > So far the best idea I have - and it's really not a good one - is to > invent a new hint-bit that tells concurrent updates to acquire a > heavyweight tuple lock, while releasing the page-level lock. If that > hint bit does not require any other modifications - and we don't need an > xid in xmax for this use case - that'll avoid doing all the other > `already_marked` stuff early, which should address the correctness > issue. It's kinda invasive though, and probably has performance > implications. > > Does anybody have a better idea? What exactly is the point of all of that already_marked stuff? I mean, suppose we just don't do any of that before we go off to do toast_insert_or_update and RelationGetBufferForTuple. Eventually, when we reacquire the page lock, we might find that somebody else has already updated the tuple, but couldn't that be handled by (approximately) looping back up to l2 just as we do in several other cases? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On 2016-06-15 08:56:52 -0400, Robert Haas wrote: > Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2, > and CTID without logging anything or clearing the all-visible flag and > then releases the lock on the heap page to go do some more work that > might even ERROR out. Only if that other work all goes OK do we > relock the page and perform the WAL-logged actions. > > That doesn't seem like a good idea even in existing releases, because > you've taken a tuple on an all-visible page and made it not > all-visible, and you've made a page modification that is not > necessarily atomic without logging it. Right, that's broken. > I'm not sure what to do about this: this part of the heap_update() > logic has been like this forever, and I assume that if it were easy to > refactor this away, somebody would have done it by now. Well, I think generally nobody seriously looked at actually refactoring heap_update(), even though that'd be a good idea. But in this instance, the problem seems relatively fundamental: We need to lock the origin page, to do visibility checks, etc. Then we need to figure out the target page. Even disregarding toasting - which we could be doing earlier with some refactoring - we're going to have to release the page level lock, to lock them in ascending order. Otherwise we'll risk kinda likely deadlocks. We also certainly don't want to nest the lwlocks for the toast stuff, inside a content lock for the old tupe's page. So far the best idea I have - and it's really not a good one - is to invent a new hint-bit that tells concurrent updates to acquire a heavyweight tuple lock, while releasing the page-level lock. If that hint bit does not require any other modifications - and we don't need an xid in xmax for this use case - that'll avoid doing all the other `already_marked` stuff early, which should address the correctness issue. It's kinda invasive though, and probably has performance implications. Does anybody have a better idea? Regards, Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Wed, Jun 15, 2016 at 08:56:52AM -0400, Robert Haas wrote: > On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro > wrote: > > I spent some time chasing down the exact circumstances. I suspect > > that there may be an interlocking problem in heap_update. Using the > > line numbers from cae1c788 [1], I see the following interaction > > between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all > > in reference to the same block number: > > > > [VACUUM] sets all visible bit > > > > [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, > > xmax_old_tuple); > > [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK); > > > > [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE); > > [SELECT] observes VM_ALL_VISIBLE as true > > [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state > > [SELECT] barfs > > > > [UPDATE] heapam.c:4116 visibilitymap_clear(...) > > Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2, > and CTID without logging anything or clearing the all-visible flag and > then releases the lock on the heap page to go do some more work that > might even ERROR out. Only if that other work all goes OK do we > relock the page and perform the WAL-logged actions. > > That doesn't seem like a good idea even in existing releases, because > you've taken a tuple on an all-visible page and made it not > all-visible, and you've made a page modification that is not > necessarily atomic without logging it. This is is particularly bad in > 9.6, because if that page is also all-frozen then XMAX will eventually > be pointing into space and VACUUM will never visit the page to > re-freeze it the way it would have done in earlier releases. However, > even in older releases, I think there's a remote possibility of data > corruption. Backend #1 makes these changes to the page, releases the > lock, and errors out. Backend #2 writes the page to the OS. DBA > takes a hot backup, tearing the page in the middle of XMAX. Oops. I agree the non-atomic, unlogged change is a problem. A related threat doesn't require a torn page: AssignTransactionId() xid=123 heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, 123); some ERROR before heap_update() finishes rollback; -- xid=123 some backend flushes the modified page immediate shutdown AssignTransactionId() xid=123 commit; -- xid=123 If nothing wrote an xlog record that witnesses xid 123, the cluster can reassign it after recovery. The failed update is now considered a successful update, and the row improperly becomes dead. That's important. I don't know whether the 9.6 all-frozen mechanism materially amplifies the consequences of this bug. The interaction with visibility map and freeze map is not all bad; indeed, it can reduce the risk of experiencing consequences from the non-atomic, unlogged change bug. If the row is all-visible when heap_update() starts, every transaction should continue to consider the row visible until heap_update() finishes successfully. If an ERROR interrupts heap_update(), visibility verdicts should be as though the heap_update() never happened. If one of the previously-described mechanisms would make an xmax visibility test give the wrong answer, an all-visible bit could mask the problem for awhile. Having said that, freeze map hurts in scenarios involving toast_insert_or_update() failures and no crash recovery. Instead of VACUUM cleaning up the aborted xmax, that xmax could persist long enough for its xid to be reused in a successful transaction. When some other modification finally clears all-frozen and all-visible, the row improperly becomes dead. Both scenarios are fairly rare; I don't know which is more rare. [Disclaimer: I have not built tests cases to verify those alleged failure mechanisms.] If we made this pre-9.6 bug a 9.6 open item, would anyone volunteer to own it? Then we wouldn't need to guess whether 9.6 will be safer with the freeze map or safer without the freeze map. Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Wed, Jun 15, 2016 at 9:59 AM, Amit Kapila wrote: >> That just kicks the can down the road. Then you have PD_ALL_VISIBLE >> clear but the VM bit is still set. > > I mean to say clear both as we are doing currently in code: > if (PageIsAllVisible(BufferGetPage(buffer))) > { > all_visible_cleared = true; > PageClearAllVisible(BufferGetPage(buffer)); > visibilitymap_clear(relation, BufferGetBlockNumber(buffer), > vmbuffer); > } Sure, but without emitting a WAL record, that's just broken. You could have the heap page get flushed to disk and the VM page not get flushed to disk, and then crash, and now you have the classic VM corruption scenario. >> And you still haven't WAL-logged >> anything. > > Yeah, I think WAL requirement is more difficult to meet and I think > releasing the lock on buffer before writing WAL could lead to flush of such > a buffer before WAL. > > I feel this is an existing-bug and should go to Older Bugs Section in open > items page. It does seem to be an existing bug, but the freeze map makes the problem more serious, I think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Wed, Jun 15, 2016 at 10:03 AM, Masahiko Sawada wrote: >> I'm not sure what to do about this: this part of the heap_update() >> logic has been like this forever, and I assume that if it were easy to >> refactor this away, somebody would have done it by now. > > How about changing collect_corrupt_items to acquire > AccessExclusiveLock for safely checking? Well, that would make it a lot less likely for pg_check_{visible,frozen} to detect the bug in heap_update(), but it wouldn't fix the bug in heap_update(). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Wed, Jun 15, 2016 at 9:56 PM, Robert Haas wrote: > On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro > wrote: >> I spent some time chasing down the exact circumstances. I suspect >> that there may be an interlocking problem in heap_update. Using the >> line numbers from cae1c788 [1], I see the following interaction >> between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all >> in reference to the same block number: >> >> [VACUUM] sets all visible bit >> >> [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, >> xmax_old_tuple); >> [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK); >> >> [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE); >> [SELECT] observes VM_ALL_VISIBLE as true >> [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state >> [SELECT] barfs >> >> [UPDATE] heapam.c:4116 visibilitymap_clear(...) > > Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2, > and CTID without logging anything or clearing the all-visible flag and > then releases the lock on the heap page to go do some more work that > might even ERROR out. Only if that other work all goes OK do we > relock the page and perform the WAL-logged actions. > > That doesn't seem like a good idea even in existing releases, because > you've taken a tuple on an all-visible page and made it not > all-visible, and you've made a page modification that is not > necessarily atomic without logging it. This is is particularly bad in > 9.6, because if that page is also all-frozen then XMAX will eventually > be pointing into space and VACUUM will never visit the page to > re-freeze it the way it would have done in earlier releases. However, > even in older releases, I think there's a remote possibility of data > corruption. Backend #1 makes these changes to the page, releases the > lock, and errors out. Backend #2 writes the page to the OS. DBA > takes a hot backup, tearing the page in the middle of XMAX. Oops. > > I'm not sure what to do about this: this part of the heap_update() > logic has been like this forever, and I assume that if it were easy to > refactor this away, somebody would have done it by now. > How about changing collect_corrupt_items to acquire AccessExclusiveLock for safely checking? Regards, -- Masahiko Sawada -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Wed, Jun 15, 2016 at 7:13 PM, Robert Haas wrote: > > On Wed, Jun 15, 2016 at 9:43 AM, Amit Kapila wrote: > > On Wed, Jun 15, 2016 at 6:26 PM, Robert Haas wrote: > >> On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro > >> wrote: > >> > I spent some time chasing down the exact circumstances. I suspect > >> > that there may be an interlocking problem in heap_update. Using the > >> > line numbers from cae1c788 [1], I see the following interaction > >> > between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all > >> > in reference to the same block number: > >> > > >> > [VACUUM] sets all visible bit > >> > > >> > [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, > >> > xmax_old_tuple); > >> > [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK); > >> > > >> > [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE); > >> > [SELECT] observes VM_ALL_VISIBLE as true > >> > [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state > >> > [SELECT] barfs > >> > > >> > [UPDATE] heapam.c:4116 visibilitymap_clear(...) > >> > >> Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2, > >> and CTID without logging anything or clearing the all-visible flag and > >> then releases the lock on the heap page to go do some more work that > >> might even ERROR out. > > > > Can't we clear the all-visible flag before releasing the lock? We can use > > logic of already_marked as it is currently used in code to clear it just > > once. > > That just kicks the can down the road. Then you have PD_ALL_VISIBLE > clear but the VM bit is still set. I mean to say clear both as we are doing currently in code: if (PageIsAllVisible(BufferGetPage(buffer))) { all_visible_cleared = true; PageClearAllVisible(BufferGetPage(buffer)); visibilitymap_clear(relation, BufferGetBlockNumber(buffer), vmbuffer); } > > And you still haven't WAL-logged > anything. > Yeah, I think WAL requirement is more difficult to meet and I think releasing the lock on buffer before writing WAL could lead to flush of such a buffer before WAL. I feel this is an existing-bug and should go to Older Bugs Section in open items page. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Reviewing freeze map code
On Wed, Jun 15, 2016 at 9:43 AM, Amit Kapila wrote: > On Wed, Jun 15, 2016 at 6:26 PM, Robert Haas wrote: >> On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro >> wrote: >> > I spent some time chasing down the exact circumstances. I suspect >> > that there may be an interlocking problem in heap_update. Using the >> > line numbers from cae1c788 [1], I see the following interaction >> > between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all >> > in reference to the same block number: >> > >> > [VACUUM] sets all visible bit >> > >> > [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, >> > xmax_old_tuple); >> > [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK); >> > >> > [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE); >> > [SELECT] observes VM_ALL_VISIBLE as true >> > [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state >> > [SELECT] barfs >> > >> > [UPDATE] heapam.c:4116 visibilitymap_clear(...) >> >> Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2, >> and CTID without logging anything or clearing the all-visible flag and >> then releases the lock on the heap page to go do some more work that >> might even ERROR out. > > Can't we clear the all-visible flag before releasing the lock? We can use > logic of already_marked as it is currently used in code to clear it just > once. That just kicks the can down the road. Then you have PD_ALL_VISIBLE clear but the VM bit is still set. And you still haven't WAL-logged anything. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Wed, Jun 15, 2016 at 6:26 PM, Robert Haas wrote: > > On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro > wrote: > > I spent some time chasing down the exact circumstances. I suspect > > that there may be an interlocking problem in heap_update. Using the > > line numbers from cae1c788 [1], I see the following interaction > > between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all > > in reference to the same block number: > > > > [VACUUM] sets all visible bit > > > > [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, xmax_old_tuple); > > [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK); > > > > [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE); > > [SELECT] observes VM_ALL_VISIBLE as true > > [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state > > [SELECT] barfs > > > > [UPDATE] heapam.c:4116 visibilitymap_clear(...) > > Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2, > and CTID without logging anything or clearing the all-visible flag and > then releases the lock on the heap page to go do some more work that > might even ERROR out. > Can't we clear the all-visible flag before releasing the lock? We can use logic of already_marked as it is currently used in code to clear it just once. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Reviewing freeze map code
On Wed, Jun 15, 2016 at 2:41 AM, Thomas Munro wrote: > I spent some time chasing down the exact circumstances. I suspect > that there may be an interlocking problem in heap_update. Using the > line numbers from cae1c788 [1], I see the following interaction > between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all > in reference to the same block number: > > [VACUUM] sets all visible bit > > [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, > xmax_old_tuple); > [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK); > > [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE); > [SELECT] observes VM_ALL_VISIBLE as true > [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state > [SELECT] barfs > > [UPDATE] heapam.c:4116 visibilitymap_clear(...) Yikes: heap_update() sets the tuple's XMAX, CMAX, infomask, infomask2, and CTID without logging anything or clearing the all-visible flag and then releases the lock on the heap page to go do some more work that might even ERROR out. Only if that other work all goes OK do we relock the page and perform the WAL-logged actions. That doesn't seem like a good idea even in existing releases, because you've taken a tuple on an all-visible page and made it not all-visible, and you've made a page modification that is not necessarily atomic without logging it. This is is particularly bad in 9.6, because if that page is also all-frozen then XMAX will eventually be pointing into space and VACUUM will never visit the page to re-freeze it the way it would have done in earlier releases. However, even in older releases, I think there's a remote possibility of data corruption. Backend #1 makes these changes to the page, releases the lock, and errors out. Backend #2 writes the page to the OS. DBA takes a hot backup, tearing the page in the middle of XMAX. Oops. I'm not sure what to do about this: this part of the heap_update() logic has been like this forever, and I assume that if it were easy to refactor this away, somebody would have done it by now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Wed, Jun 15, 2016 at 11:43 AM, Thomas Munro wrote: > On Wed, Jun 15, 2016 at 12:44 AM, Robert Haas wrote: >> On Tue, Jun 14, 2016 at 8:11 AM, Robert Haas wrote: > I noticed that the tuples that it reported were always offset 1 in a > page, and that the page always had a maxoff over a couple of hundred, > and that we called record_corrupt_item because VM_ALL_VISIBLE returned > true but HeapTupleSatisfiesVacuum on the first tuple returned > HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE. > It did that because HEAP_XMAX_COMMITTED was not set and > TransactionIdIsInProgress returned true for xmax. So this seems like it might be a visibility map bug rather than a bug in the test code, but I'm not completely sure of that. How was it legitimate to mark the page as all-visible if a tuple on the page still had a live xmax? If xmax is live and not just a locker then the tuple is not visible to the transaction that wrote xmax, at least. >>> >>> Ah, wait a minute. I see how this could happen. Hang on, let me >>> update the pg_visibility patch. >> >> The problem should be fixed in the attached revision of >> pg_check_visible. I think what happened is: >> >> 1. pg_check_visible computed an OldestXmin. >> 2. Some transaction committed. >> 3. VACUUM computed a newer OldestXmin and marked a page all-visible with it. >> 4. pg_check_visible then used its older OldestXmin to check the >> visibility of tuples on that page, and saw delete-in-progress as a >> result. >> >> I added a guard against a similar scenario involving xmin in the last >> version of this patch, but forgot that we need to protect xmax in the >> same way. With this version of the patch, I can no longer get any >> TIDs to pop up out of pg_check_visible in my testing. (I haven't run >> your test script for lack of the proper Python environment...) > > I can still reproduce the problem with this new patch. What I see is > that the OldestXmin, the new RecomputedOldestXmin and the tuple's xmax > are all the same. I spent some time chasing down the exact circumstances. I suspect that there may be an interlocking problem in heap_update. Using the line numbers from cae1c788 [1], I see the following interaction between the VACUUM, UPDATE and SELECT (pg_check_visible) backends, all in reference to the same block number: [VACUUM] sets all visible bit [UPDATE] heapam.c:3931 HeapTupleHeaderSetXmax(oldtup.t_data, xmax_old_tuple); [UPDATE] heapam.c:3938 LockBuffer(buffer, BUFFER_LOCK_UNLOCK); [SELECT] LockBuffer(buffer, BUFFER_LOCK_SHARE); [SELECT] observes VM_ALL_VISIBLE as true [SELECT] observes tuple in HEAPTUPLE_DELETE_IN_PROGRESS state [SELECT] barfs [UPDATE] heapam.c:4116 visibilitymap_clear(...) [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/heap/heapam.c;hb=cae1c788b9b43887e4a4fa51a11c3a8ffa334939 -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Wed, Jun 15, 2016 at 12:44 AM, Robert Haas wrote: > On Tue, Jun 14, 2016 at 8:11 AM, Robert Haas wrote: I noticed that the tuples that it reported were always offset 1 in a page, and that the page always had a maxoff over a couple of hundred, and that we called record_corrupt_item because VM_ALL_VISIBLE returned true but HeapTupleSatisfiesVacuum on the first tuple returned HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE. It did that because HEAP_XMAX_COMMITTED was not set and TransactionIdIsInProgress returned true for xmax. >>> >>> So this seems like it might be a visibility map bug rather than a bug >>> in the test code, but I'm not completely sure of that. How was it >>> legitimate to mark the page as all-visible if a tuple on the page >>> still had a live xmax? If xmax is live and not just a locker then the >>> tuple is not visible to the transaction that wrote xmax, at least. >> >> Ah, wait a minute. I see how this could happen. Hang on, let me >> update the pg_visibility patch. > > The problem should be fixed in the attached revision of > pg_check_visible. I think what happened is: > > 1. pg_check_visible computed an OldestXmin. > 2. Some transaction committed. > 3. VACUUM computed a newer OldestXmin and marked a page all-visible with it. > 4. pg_check_visible then used its older OldestXmin to check the > visibility of tuples on that page, and saw delete-in-progress as a > result. > > I added a guard against a similar scenario involving xmin in the last > version of this patch, but forgot that we need to protect xmax in the > same way. With this version of the patch, I can no longer get any > TIDs to pop up out of pg_check_visible in my testing. (I haven't run > your test script for lack of the proper Python environment...) I can still reproduce the problem with this new patch. What I see is that the OldestXmin, the new RecomputedOldestXmin and the tuple's xmax are all the same. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Reviewing freeze map code
On Tue, Jun 14, 2016 at 8:11 AM, Robert Haas wrote: >>> I noticed that the tuples that it reported were always offset 1 in a >>> page, and that the page always had a maxoff over a couple of hundred, >>> and that we called record_corrupt_item because VM_ALL_VISIBLE returned >>> true but HeapTupleSatisfiesVacuum on the first tuple returned >>> HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE. >>> It did that because HEAP_XMAX_COMMITTED was not set and >>> TransactionIdIsInProgress returned true for xmax. >> >> So this seems like it might be a visibility map bug rather than a bug >> in the test code, but I'm not completely sure of that. How was it >> legitimate to mark the page as all-visible if a tuple on the page >> still had a live xmax? If xmax is live and not just a locker then the >> tuple is not visible to the transaction that wrote xmax, at least. > > Ah, wait a minute. I see how this could happen. Hang on, let me > update the pg_visibility patch. The problem should be fixed in the attached revision of pg_check_visible. I think what happened is: 1. pg_check_visible computed an OldestXmin. 2. Some transaction committed. 3. VACUUM computed a newer OldestXmin and marked a page all-visible with it. 4. pg_check_visible then used its older OldestXmin to check the visibility of tuples on that page, and saw delete-in-progress as a result. I added a guard against a similar scenario involving xmin in the last version of this patch, but forgot that we need to protect xmax in the same way. With this version of the patch, I can no longer get any TIDs to pop up out of pg_check_visible in my testing. (I haven't run your test script for lack of the proper Python environment...) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company From 18815b0d6fcfc2048e47f104ef85ee981687d4de Mon Sep 17 00:00:00 2001 From: Robert Haas Date: Fri, 10 Jun 2016 14:42:46 -0400 Subject: [PATCH 1/2] Add integrity-checking functions to pg_visibility. The new pg_check_visible() and pg_check_frozen() functions can be used to verify that the visibility map bits for a relation's data pages match the actual state of the tuples on those pages. Amit Kapila and Robert Haas, reviewed by Andres Freund. Additional testing help by Thomas Munro. --- contrib/pg_visibility/Makefile| 2 +- contrib/pg_visibility/pg_visibility--1.0--1.1.sql | 17 ++ contrib/pg_visibility/pg_visibility--1.0.sql | 52 contrib/pg_visibility/pg_visibility--1.1.sql | 67 + contrib/pg_visibility/pg_visibility.c | 313 ++ contrib/pg_visibility/pg_visibility.control | 2 +- doc/src/sgml/pgvisibility.sgml| 28 +- src/tools/pgindent/typedefs.list | 1 + 8 files changed, 427 insertions(+), 55 deletions(-) create mode 100644 contrib/pg_visibility/pg_visibility--1.0--1.1.sql delete mode 100644 contrib/pg_visibility/pg_visibility--1.0.sql create mode 100644 contrib/pg_visibility/pg_visibility--1.1.sql diff --git a/contrib/pg_visibility/Makefile b/contrib/pg_visibility/Makefile index fbbaa2e..379591a 100644 --- a/contrib/pg_visibility/Makefile +++ b/contrib/pg_visibility/Makefile @@ -4,7 +4,7 @@ MODULE_big = pg_visibility OBJS = pg_visibility.o $(WIN32RES) EXTENSION = pg_visibility -DATA = pg_visibility--1.0.sql +DATA = pg_visibility--1.1.sql pg_visibility--1.0--1.1.sql PGFILEDESC = "pg_visibility - page visibility information" ifdef USE_PGXS diff --git a/contrib/pg_visibility/pg_visibility--1.0--1.1.sql b/contrib/pg_visibility/pg_visibility--1.0--1.1.sql new file mode 100644 index 000..2c97dfd --- /dev/null +++ b/contrib/pg_visibility/pg_visibility--1.0--1.1.sql @@ -0,0 +1,17 @@ +/* contrib/pg_visibility/pg_visibility--1.0--1.1.sql */ + +-- complain if script is sourced in psql, rather than via ALTER EXTENSION +\echo Use "ALTER EXTENSION pg_visibility UPDATE TO '1.1'" to load this file. \quit + +CREATE FUNCTION pg_check_frozen(regclass, t_ctid OUT tid) +RETURNS SETOF tid +AS 'MODULE_PATHNAME', 'pg_check_frozen' +LANGUAGE C STRICT; + +CREATE FUNCTION pg_check_visible(regclass, t_ctid OUT tid) +RETURNS SETOF tid +AS 'MODULE_PATHNAME', 'pg_check_visible' +LANGUAGE C STRICT; + +REVOKE ALL ON FUNCTION pg_check_frozen(regclass) FROM PUBLIC; +REVOKE ALL ON FUNCTION pg_check_visible(regclass) FROM PUBLIC; diff --git a/contrib/pg_visibility/pg_visibility--1.0.sql b/contrib/pg_visibility/pg_visibility--1.0.sql deleted file mode 100644 index da511e5..000 --- a/contrib/pg_visibility/pg_visibility--1.0.sql +++ /dev/null @@ -1,52 +0,0 @@ -/* contrib/pg_visibility/pg_visibility--1.0.sql */ - --- complain if script is sourced in psql, rather than via CREATE EXTENSION -\echo Use "CREATE EXTENSION pg_visibility" to load this file. \quit - --- Show visibility map information. -CREATE FUNCTION pg_visibility_map(regclass, blkno bigint, - all_visible OUT boolean, - all_frozen OUT
Re: [HACKERS] Reviewing freeze map code
On Tue, Jun 14, 2016 at 8:08 AM, Robert Haas wrote: > On Tue, Jun 14, 2016 at 2:53 AM, Thomas Munro > wrote: >> On Tue, Jun 14, 2016 at 4:02 AM, Robert Haas wrote: >>> On Sat, Jun 11, 2016 at 5:00 PM, Robert Haas wrote: > How about changing the return tuple of heap_prepare_freeze_tuple to > a bitmap? Two flags: "Freeze [not] done" and "[No] more freezing > needed" Yes, I think something like that sounds about right. >>> >>> Here's a patch. I took the approach of adding a separate bool out >>> parameter instead. I am also attaching an update of the >>> check-visibility patch which responds to assorted review comments and >>> adjusting it for the problems found on Friday which could otherwise >>> lead to false positives. I'm still getting occasional TIDs from the >>> pg_check_visible() function during pgbench runs, though, so evidently >>> not all is well with the world. >> >> I'm still working out how half this stuff works, but I managed to get >> pg_check_visible() to spit out a row every few seconds with the >> following brute force approach: >> >> CREATE TABLE foo (n int); >> INSERT INTO foo SELECT generate_series(1, 10); >> >> Three client threads (see attached script): >> 1. Run VACUUM in a tight loop. >> 2. Run UPDATE foo SET n = n + 1 in a tight loop. >> 3. Run SELECT pg_check_visible('foo'::regclass) in a tight loop, and >> print out any rows it produces. >> >> I noticed that the tuples that it reported were always offset 1 in a >> page, and that the page always had a maxoff over a couple of hundred, >> and that we called record_corrupt_item because VM_ALL_VISIBLE returned >> true but HeapTupleSatisfiesVacuum on the first tuple returned >> HEAPTUPLE_DELETE_IN_PROGRESS instead of the expected HEAPTUPLE_LIVE. >> It did that because HEAP_XMAX_COMMITTED was not set and >> TransactionIdIsInProgress returned true for xmax. > > So this seems like it might be a visibility map bug rather than a bug > in the test code, but I'm not completely sure of that. How was it > legitimate to mark the page as all-visible if a tuple on the page > still had a live xmax? If xmax is live and not just a locker then the > tuple is not visible to the transaction that wrote xmax, at least. Ah, wait a minute. I see how this could happen. Hang on, let me update the pg_visibility patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers