Re: [HACKERS] crash-safe visibility map, take five
On Fri, Jun 24, 2011 at 1:43 PM, Jeff Davis wrote: >> And anything that is WAL-logged must obey the WAL-before-data rule. >> We have a system already that ensures that when >> synchronous_commit=off, CLOG pages can't be flushed before the >> corresponding WAL record makes it to disk. > > In this case, how do you prevent the PD_ALL_VISIBLE from making it to > disk if you never bumped the LSN when it was set? It seems like you just > don't have the information to do so, and it seems like the information > required would be variable in size. Well, I think that would be a problem for the hypothetical implementer of the persistent snapshot feature. :-) More seriously, Heikki and I previously discussed creating some systematic method for suppressing FPIs when they are not needed, perhaps by using a bit in the page header to indicate whether an FPI has been generated since the last checkpoint. I think it would be nice to have such a system, but since we don't have a clear agreement either that it's a good idea or what we'd do after that, I'm not inclined to invest time in it. To really get any benefit out of a change in that area, we'd need probably need to (a) remove the LSN interlocks that prevent changes from being replayed if the LSN of the page has already advanced beyond the record LSN and (b) change at least some of XLOG_HEAP_{INSERT,UPDATE,DELETE} to be idempotent. But if we went in that direction then that might help to regularize some of this and make it a bit less ad-hoc. > I didn't mean to make this conversation quite so hypothetical. My > primary points are: > > 1. Sometimes it makes sense to break the typical WAL conventions for > performance reasons. But when we do so, we have to be quite careful, > because things get complicated quickly. Yes. > 2. PD_ALL_VISIBLE is a little bit more complex than other hint bits, > because the conditions under which it may be set are more complex > (having to do with both snapshots and cleanup actions). Other hint bits > are based only on transaction status: either the WAL for that > transaction completion got flushed (and is therefore permanent), and we > set the hint bit; or it didn't get flushed and we don't. I think the term "hint bits" really shouldn't be applied to anything other than HEAP_{XMIN,XMAX}_{COMMITTED,INVALID}. Otherwise, we get into confusing territory pretty quickly. Our algorithm for opportunistically killing index entries pointing to dead tuples is not WAL-logged, but it involves more than a single bit. OTOH, clearing of the PD_ALL_VISIBLE bit has always been WAL-logged, so lumping that in with HEAP_XMIN_COMMITTED is pretty misleading. > Just having this discussion has been good enough for me to get a better > idea what's going on, so if you think the comments are sufficient that's > OK with me. I'm not 100% certain they are, but let's wait and see if anyone else wants to weigh in... please do understand I'm not trying to be a pain in the neck. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take five
On Thu, 2011-06-23 at 22:02 -0400, Robert Haas wrote: > > 1. INSERT to a new page, marking it with LSN X > > 2. WAL flushed to LSN Y (Y > X) > > 2. Some persistent snapshot (that doesn't see the INSERT) is released, > > and generates WAL recording that fact with LSN Z (Z > Y) > > 3. Lazy VACUUM marks the newly all-visible page with PD_ALL_VISIBLE > > 4. page is written out because LSN is still X > > 5. crash > I don't really think that's a separate set of assumptions - if we had > some system whereby snapshots could survive a crash, then they'd have > to be WAL-logged (because that's how we make things survive crashes). In the above example, it is WAL-logged (with LSN Z). > And anything that is WAL-logged must obey the WAL-before-data rule. > We have a system already that ensures that when > synchronous_commit=off, CLOG pages can't be flushed before the > corresponding WAL record makes it to disk. In this case, how do you prevent the PD_ALL_VISIBLE from making it to disk if you never bumped the LSN when it was set? It seems like you just don't have the information to do so, and it seems like the information required would be variable in size. > I guess the point you are driving at here is that a page can only go > from being all-visible to not-all-visible by virtue of being modified. > There's no other piece of state (like a persistent snapshot) that can > be lost as part of a crash that would make us need change our mind and > decide that an all-visible XID is really not all-visible after all. > (The reverse is not true: since snapshots are ephemeral, a crash will > render every row either all-visible or dead.) I guess I never thought > about documenting that particular aspect of it because (to me) it > seems fairly self-evident. Maybe I'm wrong... I didn't mean to make this conversation quite so hypothetical. My primary points are: 1. Sometimes it makes sense to break the typical WAL conventions for performance reasons. But when we do so, we have to be quite careful, because things get complicated quickly. 2. PD_ALL_VISIBLE is a little bit more complex than other hint bits, because the conditions under which it may be set are more complex (having to do with both snapshots and cleanup actions). Other hint bits are based only on transaction status: either the WAL for that transaction completion got flushed (and is therefore permanent), and we set the hint bit; or it didn't get flushed and we don't. Just having this discussion has been good enough for me to get a better idea what's going on, so if you think the comments are sufficient that's OK with me. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take five
On Thu, Jun 23, 2011 at 6:40 PM, Jeff Davis wrote: > On Thu, 2011-06-23 at 18:18 -0400, Robert Haas wrote: >> Lazy VACUUM is the only thing that makes a page all visible. I don't >> understand the part about snapshots. > > Lazy VACUUM is the only thing that _marks_ a page with PD_ALL_VISIBLE. > > After an INSERT to a new page, and after all snapshots are released, the > page becomes all-visible; and thus subject to being marked with > PD_ALL_VISIBLE by lazy vacuum without bumping the LSN. Note that there > is no cleanup action that takes place here, so nothing else will bump > the LSN either. > > So, let's say that we hypothetically had persistent snapshots, then > you'd have the following problem: > > 1. INSERT to a new page, marking it with LSN X > 2. WAL flushed to LSN Y (Y > X) > 2. Some persistent snapshot (that doesn't see the INSERT) is released, > and generates WAL recording that fact with LSN Z (Z > Y) > 3. Lazy VACUUM marks the newly all-visible page with PD_ALL_VISIBLE > 4. page is written out because LSN is still X > 5. crash > > Now, the persistent snapshot is still present because LSN Z never made > it to disk; but the page is marked with PD_ALL_VISIBLE. > > Sure, if these hypothetical persistent snapshots were transactional, and > if synchronous_commit is on, then LSN Z would be flushed before step 3; > but that's another set of assumptions. That's why I left it simple and > said that the assumption was "snapshots are released if there's a > crash". I don't really think that's a separate set of assumptions - if we had some system whereby snapshots could survive a crash, then they'd have to be WAL-logged (because that's how we make things survive crashes). And anything that is WAL-logged must obey the WAL-before-data rule. We have a system already that ensures that when synchronous_commit=off, CLOG pages can't be flushed before the corresponding WAL record makes it to disk. For a system like what you're describing, you'd need something similar - these crash-surviving snapshots would have to make sure that no action which depended on their state hit the disk before the WAL record marking the state change hit the disk. I guess the point you are driving at here is that a page can only go from being all-visible to not-all-visible by virtue of being modified. There's no other piece of state (like a persistent snapshot) that can be lost as part of a crash that would make us need change our mind and decide that an all-visible XID is really not all-visible after all. (The reverse is not true: since snapshots are ephemeral, a crash will render every row either all-visible or dead.) I guess I never thought about documenting that particular aspect of it because (to me) it seems fairly self-evident. Maybe I'm wrong... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take five
On Thu, 2011-06-23 at 18:18 -0400, Robert Haas wrote: > Lazy VACUUM is the only thing that makes a page all visible. I don't > understand the part about snapshots. Lazy VACUUM is the only thing that _marks_ a page with PD_ALL_VISIBLE. After an INSERT to a new page, and after all snapshots are released, the page becomes all-visible; and thus subject to being marked with PD_ALL_VISIBLE by lazy vacuum without bumping the LSN. Note that there is no cleanup action that takes place here, so nothing else will bump the LSN either. So, let's say that we hypothetically had persistent snapshots, then you'd have the following problem: 1. INSERT to a new page, marking it with LSN X 2. WAL flushed to LSN Y (Y > X) 2. Some persistent snapshot (that doesn't see the INSERT) is released, and generates WAL recording that fact with LSN Z (Z > Y) 3. Lazy VACUUM marks the newly all-visible page with PD_ALL_VISIBLE 4. page is written out because LSN is still X 5. crash Now, the persistent snapshot is still present because LSN Z never made it to disk; but the page is marked with PD_ALL_VISIBLE. Sure, if these hypothetical persistent snapshots were transactional, and if synchronous_commit is on, then LSN Z would be flushed before step 3; but that's another set of assumptions. That's why I left it simple and said that the assumption was "snapshots are released if there's a crash". Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take five
On Wed, Jun 22, 2011 at 10:40 PM, Jeff Davis wrote: > 1. Torn pages -- not a problem because it's a single bit and idempotent. Right. > 2. PD_ALL_VISIBLE bit makes it to disk before a WAL record representing > an action that makes the page all-visible. Depending on what action > makes a page all-visible: > a. An old snapshot is released -- not a problem, because if there is a > crash all snapshots are released. > b. Cleanup action on the page -- not a problem, because that will > create a WAL record and update the page's LSN before setting the > PD_ALL_VISIBLE. Lazy VACUUM is the only thing that makes a page all visible. I don't understand the part about snapshots. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take five
On Wed, 2011-06-22 at 21:53 -0400, Robert Haas wrote: > On Wed, Jun 22, 2011 at 8:53 PM, Jeff Davis wrote: > > On Thu, 2011-06-16 at 23:17 -0400, Noah Misch wrote: > >> 2. In the words of a comment added by the patch: > >> * The critical integrity requirement here is that we must never end up > >> with > >> * a situation where the visibility map bit is set, and the page-level > >> * PD_ALL_VISIBLE bit is clear. If that were to occur, then a subsequent > >> * page modification would fail to clear the visibility map bit. > >> It does this by WAL-logging the operation of setting a vm bit. This also > >> has > >> the benefit of getting vm bits set correctly on standbys. > > > > In the same function, there is also the comment: > > > > "We don't bump the LSN of the heap page when setting the visibility > > map bit, because that would generate an unworkable volume of > > full-page writes. This exposes us to torn page hazards, but since > > we're not inspecting the existing page contents in any way, we > > don't care." > > > > It would be nice to have a comment explaining why that is safe with > > respect to the WAL-before-data rule. Obviously torn pages aren't much of > > a problem, because it's a single bit and completely idempotent. > > That's exactly what I was trying to explain in the comment you cite. > Feel free to propose a specific change... Well, I was a little unsure, but here's my attempt: The potential problems are: 1. Torn pages -- not a problem because it's a single bit and idempotent. 2. PD_ALL_VISIBLE bit makes it to disk before a WAL record representing an action that makes the page all-visible. Depending on what action makes a page all-visible: a. An old snapshot is released -- not a problem, because if there is a crash all snapshots are released. b. Cleanup action on the page -- not a problem, because that will create a WAL record and update the page's LSN before setting the PD_ALL_VISIBLE. First of all, is that correct? Second, are there other cases to consider? Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take five
On Wed, Jun 22, 2011 at 8:53 PM, Jeff Davis wrote: > On Thu, 2011-06-16 at 23:17 -0400, Noah Misch wrote: >> 2. In the words of a comment added by the patch: >> * The critical integrity requirement here is that we must never end up with >> * a situation where the visibility map bit is set, and the page-level >> * PD_ALL_VISIBLE bit is clear. If that were to occur, then a subsequent >> * page modification would fail to clear the visibility map bit. >> It does this by WAL-logging the operation of setting a vm bit. This also has >> the benefit of getting vm bits set correctly on standbys. > > In the same function, there is also the comment: > > "We don't bump the LSN of the heap page when setting the visibility > map bit, because that would generate an unworkable volume of > full-page writes. This exposes us to torn page hazards, but since > we're not inspecting the existing page contents in any way, we > don't care." > > It would be nice to have a comment explaining why that is safe with > respect to the WAL-before-data rule. Obviously torn pages aren't much of > a problem, because it's a single bit and completely idempotent. That's exactly what I was trying to explain in the comment you cite. Feel free to propose a specific change... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take five
On Thu, 2011-06-16 at 23:17 -0400, Noah Misch wrote: > 2. In the words of a comment added by the patch: > * The critical integrity requirement here is that we must never end up with > * a situation where the visibility map bit is set, and the page-level > * PD_ALL_VISIBLE bit is clear. If that were to occur, then a subsequent > * page modification would fail to clear the visibility map bit. > It does this by WAL-logging the operation of setting a vm bit. This also has > the benefit of getting vm bits set correctly on standbys. In the same function, there is also the comment: "We don't bump the LSN of the heap page when setting the visibility map bit, because that would generate an unworkable volume of full-page writes. This exposes us to torn page hazards, but since we're not inspecting the existing page contents in any way, we don't care." It would be nice to have a comment explaining why that is safe with respect to the WAL-before-data rule. Obviously torn pages aren't much of a problem, because it's a single bit and completely idempotent. Regards, Jeff Davis -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take five
On Sat, Jun 18, 2011 at 10:41 PM, Noah Misch wrote: > This version of the patch has some bogus hunks in int.c, int8.c, and heap.c > reverting a few of your other recent commits. Woops, good catch. Corrected version attached, in case anyone else wants to take a look at it. > Looks good otherwise. I've > marked the patch "Ready for Committer". Thanks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company visibility-map-v7.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take five
On Sat, Jun 18, 2011 at 10:04:52PM -0400, Robert Haas wrote: > On Sat, Jun 18, 2011 at 9:37 AM, Noah Misch wrote: > > Indeed. ?If we conflicted on the xmin of the most-recent all-visible tuple > > when setting the bit, all-visible on the standby would become a superset of > > all-visible on the master, and we could cease to ignore the bits. ?This is > > an > > excellent trade-off for masters that do UPDATE and DELETE, because they're > > already conflicting (or using avoidance measures) on similar xids at VACUUM > > time. ?(Some INSERT-only masters will disfavor the trade-off, but we could > > placate them with a GUC. ?On the other hand, hot_standby_feedback works and > > costs an INSERT-only master so little.) > > I hadn't really considered the possibility that making more things > conflict on the standby server could work out to a win. I think that > would need some careful testing, and I don't want to get tied up in > that for purposes of this patch. There is no law against a WAL format > change, so we can always do it later if someone wants to do the > legwork. Fair enough. > > I ran one repetition of my benchmark from last report, and the amount of WAL > > has not changed. ?Given that, I think the previous runs remain valid. > > As far as I can see, the upshot of those benchmarks was that more WAL > was generated, but the time to execute vacuum didn't really change. I > think that's going to be pretty typical. In your test, the additional > WAL amounted to about 0.6% of the amount of data VACUUM is dirtying, > which I think is pretty much in the noise, assuming wal_buffers and > checkpoint_segments are tuned to suitable values. I think that's about right. The timings were too unstable to conclude much, so I've focused on the WAL volume, which isn't worrisome. > The other thing to worry about from a performance view is whether the > push-ups required to relocate the clear-the-vm-bits logic to within > the critical sections is going to have a significant impact on > ordinary insert/update/delete operations. I was a bit unsure at first > about Heikki's contention that it wouldn't matter, but after going > through the exercise of writing the code I think I see now why he > wasn't concerned. I agree; that part of the cost looks unmeasurable. > The only possibly-noticeable overhead comes from > the possibility that we might decide not to pin the visibility map > buffer before locking the page(s) we're operating on, and need to > unlock-pin-and-relock. But for that to happen, VACUUM's got to buzz > through and mark the page all-visible just around the time that we > examine bit. And I have a tough time imagining that happening very > often. You have to have a page that has been modified since the last > VACUUM, but not too recently (otherwise it's not actually all-visible) > and then VACUUM has to get there and process it at almost the same > moment that someone decides to make another modification. It's hard > even to think about an artificial test case that would hit that > situation with any regularity. I can't see it happening very often, either. This version of the patch has some bogus hunks in int.c, int8.c, and heap.c reverting a few of your other recent commits. Looks good otherwise. I've marked the patch "Ready for Committer". Thanks, nm -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take five
On Sat, Jun 18, 2011 at 9:37 AM, Noah Misch wrote: > Indeed. If we conflicted on the xmin of the most-recent all-visible tuple > when setting the bit, all-visible on the standby would become a superset of > all-visible on the master, and we could cease to ignore the bits. This is an > excellent trade-off for masters that do UPDATE and DELETE, because they're > already conflicting (or using avoidance measures) on similar xids at VACUUM > time. (Some INSERT-only masters will disfavor the trade-off, but we could > placate them with a GUC. On the other hand, hot_standby_feedback works and > costs an INSERT-only master so little.) I hadn't really considered the possibility that making more things conflict on the standby server could work out to a win. I think that would need some careful testing, and I don't want to get tied up in that for purposes of this patch. There is no law against a WAL format change, so we can always do it later if someone wants to do the legwork. > No problem. I ran these test cases with the new patch: > > -Description- -Expected bits- > -Result- > set bit, VACUUM commit, crash 1,5 > 1,5 > set bit, crash 0,1 > 0,1 > set bit, flush heap page, crash 0,5 > 0,5 > set bit, flush vm page, crash 1,5 > 1,5 > > xlog flush request locations look correct, too. Overall, looking good. Do > you know of other notable cases to check? The last two were somewhat tricky > to inject; if anyone wishes to reproduce them, I'm happy to share my recipe. Those sound like interesting recipes... > One more thing came to mind: don't we need a critical section inside the "if" > block in visibilitymap_set()? I can't see anything in there that could > elog(ERROR, ...), but my general impression is that we use one anyway. Seems like a good idea. > I ran one repetition of my benchmark from last report, and the amount of WAL > has not changed. Given that, I think the previous runs remain valid. As far as I can see, the upshot of those benchmarks was that more WAL was generated, but the time to execute vacuum didn't really change. I think that's going to be pretty typical. In your test, the additional WAL amounted to about 0.6% of the amount of data VACUUM is dirtying, which I think is pretty much in the noise, assuming wal_buffers and checkpoint_segments are tuned to suitable values. The other thing to worry about from a performance view is whether the push-ups required to relocate the clear-the-vm-bits logic to within the critical sections is going to have a significant impact on ordinary insert/update/delete operations. I was a bit unsure at first about Heikki's contention that it wouldn't matter, but after going through the exercise of writing the code I think I see now why he wasn't concerned. The only possibly-noticeable overhead comes from the possibility that we might decide not to pin the visibility map buffer before locking the page(s) we're operating on, and need to unlock-pin-and-relock. But for that to happen, VACUUM's got to buzz through and mark the page all-visible just around the time that we examine bit. And I have a tough time imagining that happening very often. You have to have a page that has been modified since the last VACUUM, but not too recently (otherwise it's not actually all-visible) and then VACUUM has to get there and process it at almost the same moment that someone decides to make another modification. It's hard even to think about an artificial test case that would hit that situation with any regularity. >> Hmm, now that I look at it, I think this test is backwards too. I >> think you might be right that it wouldn't hurt anything to go ahead >> and set it, but I'm inclined to leave it in for now. > > Okay. Consider expanding the comment to note that this is out of conservatism > rather than known necessity. Done. > Probably not worth mentioning, but there remains one space instead of one tab > after "visibilitymap_clear" in this line: > * visibilitymap_clear - clear a bit in the visibility map Gee, I'm not sure whether there's a one true way of doing that. I know we have a no-spaces-before-tabs rule but I'm not sure what the guidelines are for indenting elsewhere in the line. > FWIW, I just do C-s TAB in emacs and then scan for places where the boundaries > of the highlighted regions fail to line up vertically. vim. >> > This fills RelationGetBufferForTuple()'s needs, but the name doesn't seem >> > to >> > fit the task too well. ?This function doesn't check the pin or that the >> > buffer >> > applies to the correct relation. ?Consider naming it something like >> > `visibilitymap_buffer_covers_block'. >> >> I think that this may be related to the fact that visibility
Re: [HACKERS] crash-safe visibility map, take five
On Fri, Jun 17, 2011 at 01:21:17PM -0400, Robert Haas wrote: > On Thu, Jun 16, 2011 at 11:17 PM, Noah Misch wrote: > > I suggest revisiting the suggestion in > > http://archives.postgresql.org/message-id/27743.1291135...@sss.pgh.pa.us and > > including a latestRemovedXid in xl_heap_visible. ?The range of risky xids is > > the same for setting a visibility map bit as for deleting a dead tuple, and > > the same operation (VACUUM) produces both conflicts. > > See Heikki's follow-up email. The standby has to ignore all-visible > bits anyway, because the fact that a transaction is all-visible on the > master does not imply that it is all-visible on the standby. So I > don't think there's a problem here. Indeed. If we conflicted on the xmin of the most-recent all-visible tuple when setting the bit, all-visible on the standby would become a superset of all-visible on the master, and we could cease to ignore the bits. This is an excellent trade-off for masters that do UPDATE and DELETE, because they're already conflicting (or using avoidance measures) on similar xids at VACUUM time. (Some INSERT-only masters will disfavor the trade-off, but we could placate them with a GUC. On the other hand, hot_standby_feedback works and costs an INSERT-only master so little.) > > I proceeded to do some immediate-shutdown tests to see if we get the bits > > set > > as expected. ?I set up a database like this: > > ? ? ? ?CREATE EXTENSION pageinspect; > > ? ? ? ?CREATE TABLE t (c int); > > ? ? ? ?INSERT INTO t VALUES (2); > > ? ? ? ?CHECKPOINT; > > I normally cleared bits with "UPDATE t SET c = 1; CHECKPOINT;" and set them > > with "VACUUM t". ?I checked bits with this query: > > ? ? ? ?SELECT to_hex(get_byte(get_raw_page('t', 'vm', 0), 24)), > > ? ? ? ? ? ? ? to_hex(flags::int) FROM page_header(get_raw_page('t', 0)); > > The row from that query should generally be 0,1 (bits unset) or 1,5 (bits > > set). ?0,5 is fine after a crash. ?1,1 means we've broken our contract: the > > VM > > bit is set while PD_ALL_VISIBLE is not set. > > > > First test was to clear bits, checkpoint, then VACUUM and SIGQUIT the > > postmaster. ?The system came back up with 1/1 bits. ?I poked around enough > > to > > see that XLByteLE(lsn, PageGetLSN(page)) was failing, but I did not dig any > > deeper toward a root cause. ?If you have trouble reproducing this, let me > > know > > so I can assemble a complete, self-contained test case. > > Thank you for putting these test cases together. As you can probably > tell, I was having difficulty figuring out exactly how to test this. No problem. I ran these test cases with the new patch: -Description- -Expected bits- -Result- set bit, VACUUM commit, crash 1,5 1,5 set bit, crash 0,1 0,1 set bit, flush heap page, crash 0,5 0,5 set bit, flush vm page, crash 1,5 1,5 xlog flush request locations look correct, too. Overall, looking good. Do you know of other notable cases to check? The last two were somewhat tricky to inject; if anyone wishes to reproduce them, I'm happy to share my recipe. One more thing came to mind: don't we need a critical section inside the "if" block in visibilitymap_set()? I can't see anything in there that could elog(ERROR, ...), but my general impression is that we use one anyway. I ran one repetition of my benchmark from last report, and the amount of WAL has not changed. Given that, I think the previous runs remain valid. > I think that the problem here is that the sense of that test is > exactly backwards from what it should be. IIUC, the word "precedes" > in the previous comment should in fact say "follows". Ah; quite so. > >> + ? ? ? ? ? ? /* Don't set the bit if replay has already passed this > >> point. */ > >> + ? ? ? ? ? ? if (XLByteLE(lsn, PageGetLSN(BufferGetPage(vmbuffer > >> + ? ? ? ? ? ? ? ? ? ? visibilitymap_set(reln, xlrec->block, lsn, > >> &vmbuffer); > > > > ... wouldn't it be better to do this unconditionally? ?Some later record > > will > > unset it if needed, because all replay functions that clear the bit do so > > without consulting the vm page LSN. ?On the other hand, the worst > > consequence > > of the way you've done it is VACUUM visiting the page one more time to set > > the > > bit. > > Hmm, now that I look at it, I think this test is backwards too. I > think you might be right that it wouldn't hurt anything to go ahead > and set it, but I'm inclined to leave it in for now. Okay. Consider expanding the comment to note that this is out of conservatism rather than known necessity. > > I think it's worth noting, perhaps based on your explanation in the > > second-to-last paragraph of > > http://archives.postgresql.org/message-id/BANLkTi=b7jvmq6fa_exlcygzuyv1u9a
Re: [HACKERS] crash-safe visibility map, take five
On Thu, Jun 16, 2011 at 11:17 PM, Noah Misch wrote: > I took a look at this patch. No kidding! Thanks for the very detailed review. > +1 for Buffer over Buffer * when the value isn't mutated for the caller. I changed this. > I suggest revisiting the suggestion in > http://archives.postgresql.org/message-id/27743.1291135...@sss.pgh.pa.us and > including a latestRemovedXid in xl_heap_visible. The range of risky xids is > the same for setting a visibility map bit as for deleting a dead tuple, and > the same operation (VACUUM) produces both conflicts. See Heikki's follow-up email. The standby has to ignore all-visible bits anyway, because the fact that a transaction is all-visible on the master does not imply that it is all-visible on the standby. So I don't think there's a problem here. > lazy_scan_heap() has two calls to visibilitymap_set(), but the patch only > changed the recptr argument for one of them. This has the effect that we only > emit WAL for empty pages and pages that happened to have pd_lsn == {0,0}, such > as those not modified since the transaction creating the table. I fixed this > before testing further. Good catch, thanks. I also added the Assert() that you recommended further down. > This happens due to heap_xlog_redo() calling UnlockReleaseBuffer() despite > having taken no buffer content lock. I added > LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); > before the "if" to get past this. Fixed, thanks. > I proceeded to do some immediate-shutdown tests to see if we get the bits set > as expected. I set up a database like this: > CREATE EXTENSION pageinspect; > CREATE TABLE t (c int); > INSERT INTO t VALUES (2); > CHECKPOINT; > I normally cleared bits with "UPDATE t SET c = 1; CHECKPOINT;" and set them > with "VACUUM t". I checked bits with this query: > SELECT to_hex(get_byte(get_raw_page('t', 'vm', 0), 24)), > to_hex(flags::int) FROM page_header(get_raw_page('t', 0)); > The row from that query should generally be 0,1 (bits unset) or 1,5 (bits > set). 0,5 is fine after a crash. 1,1 means we've broken our contract: the VM > bit is set while PD_ALL_VISIBLE is not set. > > First test was to clear bits, checkpoint, then VACUUM and SIGQUIT the > postmaster. The system came back up with 1/1 bits. I poked around enough to > see that XLByteLE(lsn, PageGetLSN(page)) was failing, but I did not dig any > deeper toward a root cause. If you have trouble reproducing this, let me know > so I can assemble a complete, self-contained test case. Thank you for putting these test cases together. As you can probably tell, I was having difficulty figuring out exactly how to test this. I think that the problem here is that the sense of that test is exactly backwards from what it should be. IIUC, the word "precedes" in the previous comment should in fact say "follows". > I would delete this comment. We were done earlier, but we needed to finish > the > critical section. Done. > Concerning the optimization in xlog_heap_delete() et al. of not changing the > page when its LSN is newer -- am I correct that it only ever triggers when > full_page_writes = off? Assuming yes ... I believe that's right. >> + /* >> + * Even we skipped the heap page update due to the LSN interlock, it's >> + * still safe to update the visibility map. Any WAL record that clears >> + * the visibility map bit does so before checking the page LSN, so any >> + * bits that need to be cleared will still be cleared. >> + */ >> + if (record->xl_info & XLR_BKP_BLOCK_1) >> + RestoreBkpBlocks(lsn, record, false); >> + else >> + { >> + Relation reln; >> + Buffer vmbuffer = InvalidBuffer; >> + >> + reln = CreateFakeRelcacheEntry(xlrec->node); >> + visibilitymap_pin(reln, xlrec->block, &vmbuffer); >> + /* Don't set the bit if replay has already passed this point. >> */ >> + if (XLByteLE(lsn, PageGetLSN(BufferGetPage(vmbuffer >> + visibilitymap_set(reln, xlrec->block, lsn, &vmbuffer); > > ... wouldn't it be better to do this unconditionally? Some later record will > unset it if needed, because all replay functions that clear the bit do so > without consulting the vm page LSN. On the other hand, the worst consequence > of the way you've done it is VACUUM visiting the page one more time to set the > bit. Hmm, now that I look at it, I think this test is backwards too. I think you might be right that it wouldn't hurt anything to go ahead and set it, but I'm inclined to leave it in for now. > I think it's worth noting, perhaps based on your explanation in the > second-to-last paragraph of > http://archives.postgresql.org/message-id/BANLkTi=b7jvmq6fa_exlcygzuyv1u9a...@mail.gmail.com > that the answer may be incorrect again after the recheck. We don't care: > redundant clearings of the visibility bit are no
Re: [HACKERS] crash-safe visibility map, take five
Robert, I took a look at this patch. To summarize its purpose as I understand it, it does two independent but synergistic things: 1. Move the INSERT/UPDATE/DELETE clearing of visibility map bits from after the main operation to before it. This has no affect on crash recovery. It closes a race condition described in the last paragraphs of visibilitymap.c's header comment. 2. In the words of a comment added by the patch: * The critical integrity requirement here is that we must never end up with * a situation where the visibility map bit is set, and the page-level * PD_ALL_VISIBLE bit is clear. If that were to occur, then a subsequent * page modification would fail to clear the visibility map bit. It does this by WAL-logging the operation of setting a vm bit. This also has the benefit of getting vm bits set correctly on standbys. On Mon, May 23, 2011 at 03:54:44PM -0400, Robert Haas wrote: > On Wed, May 11, 2011 at 8:46 AM, Pavan Deolasee > wrote: > > Why do the set and clear functions need pass-by-reference (Buffer *) > > argument ? I don't see them modifying the argument at all. This patch adds > > the clear function, but the existing set function also suffers from that. > > Yep, I just copied the style of the existing function, in the interest > of changing no more than necessary. But maybe it'd be better to > change them both to take just the Buffer, instead of a pointer. > Anyone else want to weigh in? +1 for Buffer over Buffer * when the value isn't mutated for the caller. I suggest revisiting the suggestion in http://archives.postgresql.org/message-id/27743.1291135...@sss.pgh.pa.us and including a latestRemovedXid in xl_heap_visible. The range of risky xids is the same for setting a visibility map bit as for deleting a dead tuple, and the same operation (VACUUM) produces both conflicts. The system that has fueled my reports related to standby conflict would hit a snapshot conflict more or less instantly without vacuum_defer_cleanup_age. (With 9.1, we'll use hot_standby_feedback and never look back.) Adding visibility map bits as a source of conflict costs nothing to a system like that. You might not like it on an INSERT-only system whose VACUUMs only update visibility. Such systems might like a GUC on the standby that disables both index-only scans and conflict on xl_heap_visible.lastestRemovedXid. For normal read/write systems, there will be no advantage. The patch follows a design hashed in Nov-Dec 2010. Currently, it adds overhead to VACUUM with thin practical benefit. However, there's broad community acceptance that this is a step on a deterministic path to success at implementing index-only scans, for worthy benefit. lazy_scan_heap() has two calls to visibilitymap_set(), but the patch only changed the recptr argument for one of them. This has the effect that we only emit WAL for empty pages and pages that happened to have pd_lsn == {0,0}, such as those not modified since the transaction creating the table. I fixed this before testing further. Next, for no especially good reason, I set a breakpoint on the closing brace of visibilitymap_set, ran a VACUUM that called it, and instigated a PANIC from gdb. Recovery failed like this: 3614 2011-06-16 20:05:18.118 EDT LOG: 0: redo starts at 0/172ABA0 3614 2011-06-16 20:05:18.118 EDT LOCATION: StartupXLOG, xlog.c:6494 3614 2011-06-16 20:05:18.119 EDT FATAL: XX000: lock 148 is not held 3614 2011-06-16 20:05:18.119 EDT CONTEXT: xlog redo visible: rel 1663/16384/24588; blk 0 3614 2011-06-16 20:05:18.119 EDT LOCATION: LWLockRelease, lwlock.c:587 TRAP: FailedAssertion("!(PrivateRefCount[i] == 0)", File: "bufmgr.c", Line: 1695) 3613 2011-06-16 20:05:18.119 EDT LOG: 0: startup process (PID 3614) was terminated by signal 6: Aborted 3613 2011-06-16 20:05:18.119 EDT LOCATION: LogChildExit, postmaster.c:2881 3613 2011-06-16 20:05:18.119 EDT LOG: 0: aborting startup due to startup process failure 3613 2011-06-16 20:05:18.119 EDT LOCATION: reaper, postmaster.c:2376 This happens due to heap_xlog_redo() calling UnlockReleaseBuffer() despite having taken no buffer content lock. I added LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE); before the "if" to get past this. I proceeded to do some immediate-shutdown tests to see if we get the bits set as expected. I set up a database like this: CREATE EXTENSION pageinspect; CREATE TABLE t (c int); INSERT INTO t VALUES (2); CHECKPOINT; I normally cleared bits with "UPDATE t SET c = 1; CHECKPOINT;" and set them with "VACUUM t". I checked bits with this query: SELECT to_hex(get_byte(get_raw_page('t', 'vm', 0), 24)), to_hex(flags::int) FROM page_header(get_raw_page('t', 0)); The row from that query should generally be 0,1 (bits unset) or 1,5 (bits set). 0,5 is fine after a crash. 1,1 means we've broken our contract: the VM bit is set while PD_ALL_VISIBLE is not set. First test was to clear bits, checkpoint, th
Re: [HACKERS] crash-safe visibility map, take five
On Wed, May 11, 2011 at 8:46 AM, Pavan Deolasee wrote: > Why do the set and clear functions need pass-by-reference (Buffer *) > argument ? I don't see them modifying the argument at all. This patch adds > the clear function, but the existing set function also suffers from that. Yep, I just copied the style of the existing function, in the interest of changing no more than necessary. But maybe it'd be better to change them both to take just the Buffer, instead of a pointer. Anyone else want to weigh in? > There are several invocations of pin/clear/release combos. May be you would > want a convenience routine for doing that in a single step or just pass > InvalidBuffer to clear() in which case, it would assume that the vm buffer > is not pinned and do the needful. I don't think there are enough copies of this logic to worry about it; and I think that keeping those routines separate is more clear. > The comment at the top of visibilitymap_pin_ok says "On entry, *buf", but > the function really takes just a buf. Good catch.Very slightly updated patch attached. > You can possibly fold > visibilitymap_pin_ok() into a macro (and also name it slightly differently > like visibilitymap_is_pinned ?). Well, that name would be kind of misleading, because it's not so much checking whether we have a pin, but rather whether any pin we have is the right one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company visibility-map-v4.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take five
On Tue, May 10, 2011 at 7:38 PM, Robert Haas wrote: > On Tue, May 10, 2011 at 9:45 AM, Merlin Moncure > wrote: > > I see: here's a comment that was throwing me off: > > + /* > > +* If we didn't get the lock and it turns out we need it, we'll > have to > > +* unlock and re-lock, to avoid holding the buffer lock across an > I/O. > > +* That's a bit unfortunate, but hopefully shouldn't happen > often. > > +*/ > > > > I think that might be phrased as "didn't get the pin and it turns out > > we need it because the bit can change after inspection". The visible > > bit isn't 'wrong' as suggested in the comments, it just can change so > > that it becomes wrong. Maybe a note of why it could change would be > > helpful. > > Oh, I see. I did write "lock" when I meant "pin", and your other > point is well-taken as well. Here's a revised version with some > additional wordsmithing. > > Some trivial comments. Why do the set and clear functions need pass-by-reference (Buffer *) argument ? I don't see them modifying the argument at all. This patch adds the clear function, but the existing set function also suffers from that. There are several invocations of pin/clear/release combos. May be you would want a convenience routine for doing that in a single step or just pass InvalidBuffer to clear() in which case, it would assume that the vm buffer is not pinned and do the needful. The comment at the top of visibilitymap_pin_ok says "On entry, *buf", but the function really takes just a buf. You can possibly fold visibilitymap_pin_ok() into a macro (and also name it slightly differently like visibilitymap_is_pinned ?). Thanks, Pavan -- Pavan Deolasee EnterpriseDB http://www.enterprisedb.com
Re: [HACKERS] crash-safe visibility map, take five
On Tue, May 10, 2011 at 1:00 PM, Tom Lane wrote: > Robert Haas writes: >> On Tue, May 10, 2011 at 11:34 AM, Jesper Krogh wrote: >>> Or what is the downside for keeping it across IO? Will it block other >>> processes trying to read it? > >> Heikki might be in a better position to comment on that than I am, >> since he wrote the existing code. But I think that's basically the >> issue. When one process has an exclusive buffer lock, nobody else can >> scan the tuples in that block - so a sequential scan, for example, >> that reached that block, or an index scan that needed to probe into >> it, would pile up behind the read of the visibility map page. > > Right, it's the loss of potential concurrency that's annoying here. > > On the other hand, the concurrency loss might be entirely theoretical > --- in particular, if other potential readers of the heap page would > probably also need to wait for the visibility page to come in, then > nothing is gained by letting them acquire the heap page lock sooner. > > I've not read this thread in any detail yet, but if we're going to be > jumping through extremely complex hoops to avoid that scenario, it might > be better to KISS ... especially in the first iteration. I wouldn't describe the hoops as extremely complex; I don't feel any inclination to simplify the patch beyond what it is right now. Of course, we'll see what the feedback looks like after more people have read the patch, but my feeling is that the patch strikes a reasonable balance between performance and keeping it simple. There are some more complicated shenanigans that I started to experiment with and ripped out as premature optimization, but this part I think is OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take five
Robert Haas writes: > On Tue, May 10, 2011 at 11:34 AM, Jesper Krogh wrote: >> Or what is the downside for keeping it across IO? Will it block other >> processes trying to read it? > Heikki might be in a better position to comment on that than I am, > since he wrote the existing code. But I think that's basically the > issue. When one process has an exclusive buffer lock, nobody else can > scan the tuples in that block - so a sequential scan, for example, > that reached that block, or an index scan that needed to probe into > it, would pile up behind the read of the visibility map page. Right, it's the loss of potential concurrency that's annoying here. On the other hand, the concurrency loss might be entirely theoretical --- in particular, if other potential readers of the heap page would probably also need to wait for the visibility page to come in, then nothing is gained by letting them acquire the heap page lock sooner. I've not read this thread in any detail yet, but if we're going to be jumping through extremely complex hoops to avoid that scenario, it might be better to KISS ... especially in the first iteration. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take five
On Tue, May 10, 2011 at 11:34 AM, Jesper Krogh wrote: > On 2011-05-10 14:48, Robert Haas wrote: >> >> We could avoid all of this complexity - and the possibility of pinning >> the visibility map page needlessly - by locking the heap buffer first >> and then pinning the visibility map page if the heap page is >> all-visible. However, that would involve holding the lock on the heap >> buffer across a possible disk I/O to bring the visibility map page >> into memory, which is something the existing code tries pretty hard to >> avoid. > > Assuming that the visibillity map would be used for visibillity testing, > just picking the lock would effectively mean "we want it in the buffers", > which would not be that bad? > > Or what is the downside for keeping it across IO? Will it block other > processes trying to read it? Heikki might be in a better position to comment on that than I am, since he wrote the existing code. But I think that's basically the issue. When one process has an exclusive buffer lock, nobody else can scan the tuples in that block - so a sequential scan, for example, that reached that block, or an index scan that needed to probe into it, would pile up behind the read of the visibility map page. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take five
On 2011-05-10 14:48, Robert Haas wrote: We could avoid all of this complexity - and the possibility of pinning the visibility map page needlessly - by locking the heap buffer first and then pinning the visibility map page if the heap page is all-visible. However, that would involve holding the lock on the heap buffer across a possible disk I/O to bring the visibility map page into memory, which is something the existing code tries pretty hard to avoid. Assuming that the visibillity map would be used for visibillity testing, just picking the lock would effectively mean "we want it in the buffers", which would not be that bad? Or what is the downside for keeping it across IO? Will it block other processes trying to read it? -- Jesper -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take five
On Tue, May 10, 2011 at 9:45 AM, Merlin Moncure wrote: > I see: here's a comment that was throwing me off: > + /* > + * If we didn't get the lock and it turns out we need it, we'll have > to > + * unlock and re-lock, to avoid holding the buffer lock across an I/O. > + * That's a bit unfortunate, but hopefully shouldn't happen often. > + */ > > I think that might be phrased as "didn't get the pin and it turns out > we need it because the bit can change after inspection". The visible > bit isn't 'wrong' as suggested in the comments, it just can change so > that it becomes wrong. Maybe a note of why it could change would be > helpful. Oh, I see. I did write "lock" when I meant "pin", and your other point is well-taken as well. Here's a revised version with some additional wordsmithing. > Other than that, it looks pretty good...ISTM an awfully small amount > of code to provide what it's doing (that's a good thing!). Thanks. It's definitely not big in terms of code footprint; it's mostly a matter of making sure we've dotted all the "i"s and crossed all the "t"s. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company visibility-map-v3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take five
On Tue, May 10, 2011 at 7:48 AM, Robert Haas wrote: > I thought I'd explained it fairly thoroughly in the comments, but > evidently not. Suggestions for improvement are welcome. ok. that clears it up nicely. > Here goes in more detail: Every time we insert, update, or delete a > tuple in a particular heap page, we must check whether the page is > marked all-visible. If it is, then we need to clear the page-level > bit marking it as all-visible, and also the corresponding page in the > visibility map. On the other hand, if the page isn't marked > all-visible, then we needn't touch the visibility map at all. So, > there are either one or two buffers involved: the buffer containing > the heap page ("buffer") and possibly also a buffer containing the > visibility map page in which the bit for the heap page is to be found > ("vmbuffer"). Before taking an exclusive content-lock on the heap > buffer, we check whether the page appears to be all-visible. If it > does, then we pin the visibility map page and then lock the buffer. > If not, we just lock the buffer. I see: here's a comment that was throwing me off: + /* +* If we didn't get the lock and it turns out we need it, we'll have to +* unlock and re-lock, to avoid holding the buffer lock across an I/O. +* That's a bit unfortunate, but hopefully shouldn't happen often. +*/ I think that might be phrased as "didn't get the pin and it turns out we need it because the bit can change after inspection". The visible bit isn't 'wrong' as suggested in the comments, it just can change so that it becomes wrong. Maybe a note of why it could change would be helpful. Other than that, it looks pretty good...ISTM an awfully small amount of code to provide what it's doing (that's a good thing!). merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take five
On Mon, May 9, 2011 at 10:25 PM, Merlin Moncure wrote: > On Fri, May 6, 2011 at 5:47 PM, Robert Haas wrote: >> On Wed, Mar 30, 2011 at 8:52 AM, Heikki Linnakangas >> wrote: Another question: To address the problem in http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php , should we just clear the vm before the log of insert/update/delete? This may reduce the performance, is there another solution? >>> >>> Yeah, that's a straightforward way to fix it. I don't think the performance >>> hit will be too bad. But we need to be careful not to hold locks while doing >>> I/O, which might require some rearrangement of the code. We might want to do >>> a similar dance that we do in vacuum, and call visibilitymap_pin first, then >>> lock and update the heap page, and then set the VM bit while holding the >>> lock on the heap page. >> >> Here's an attempt at implementing the necessary gymnastics. > > Is there a quick synopsis of why you have to do (sometimes) the > pin->lock->unlock->pin->lock mechanic? How come you only can fail to > get the pin at most once? I thought I'd explained it fairly thoroughly in the comments, but evidently not. Suggestions for improvement are welcome. Here goes in more detail: Every time we insert, update, or delete a tuple in a particular heap page, we must check whether the page is marked all-visible. If it is, then we need to clear the page-level bit marking it as all-visible, and also the corresponding page in the visibility map. On the other hand, if the page isn't marked all-visible, then we needn't touch the visibility map at all. So, there are either one or two buffers involved: the buffer containing the heap page ("buffer") and possibly also a buffer containing the visibility map page in which the bit for the heap page is to be found ("vmbuffer"). Before taking an exclusive content-lock on the heap buffer, we check whether the page appears to be all-visible. If it does, then we pin the visibility map page and then lock the buffer. If not, we just lock the buffer. However, since we weren't holding any lock, it's possible that between the time when we checked the visibility map bit and the time when we obtained the exclusive buffer-lock, the visibility map bit might have changed from clear to set (because someone is concurrently running VACUUM on the table; or on platforms with weak memory-ordering, someone was running VACUUM "almost" concurrently). If that happens, we give up our buffer lock, go pin the visibility map page, and reacquire the buffer lock. At this point in the process, we know that *if* the page is marked all-visible, *then* we have the appropriate visibility map page pinned. There are three possible pathways: (1) If the buffer initially appeared to be all-visible, we will have pinned the visibility map page before acquiring the exclusive lock; (2) If the buffer initially appeared NOT to be all-visible, but by the time we obtained the exclusive lock it now appeared to be all-visible, then we will have done the unfortunate unlock-pin-relock dance, and the visibility map page will now be pinned; (3) if the buffer initially appeared NOT to be all-visible, and by the time we obtained the exclusive lock it STILL appeared NOT to be all-visible, then we don't have the visibility map page pinned - but that's OK, because in this case no operation on the visibility map needs to be performed. Now it is very possible that in case (1) or (2) the visibility map bit, though we saw it set at some point, will actually have been cleared in the meantime. In case (1), this could happen before we obtain the exclusive lock; while in case (2), it could happen after we give up the lock to go pin the visibility map page and before we reacquire it. This will typically happen when a buffer has been sitting around for a while in an all-visible state and suddenly two different backends both try to update or delete tuples in that buffer at almost exactly the same time. But it causes no great harm - both backends will pin the visibility map page, whichever one gets the exclusive lock on the heap page first will clear it, and when the other backend gets the heap page afterwards, it will see that the bit has already been cleared and do nothing further. We've wasted the effort of pinning and unpinning the visibility map page when it wasn't really necessary, but that's not the end of the world. We could avoid all of this complexity - and the possibility of pinning the visibility map page needlessly - by locking the heap buffer first and then pinning the visibility map page if the heap page is all-visible. However, that would involve holding the lock on the heap buffer across a possible disk I/O to bring the visibility map page into memory, which is something the existing code tries pretty hard to avoid. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-
Re: [HACKERS] crash-safe visibility map, take five
On Fri, May 6, 2011 at 5:47 PM, Robert Haas wrote: > On Wed, Mar 30, 2011 at 8:52 AM, Heikki Linnakangas > wrote: >>> Another question: >>> To address the problem in >>> http://archives.postgresql.org/pgsql-hackers/2010-02/msg02097.php >>> , should we just clear the vm before the log of insert/update/delete? >>> This may reduce the performance, is there another solution? >> >> Yeah, that's a straightforward way to fix it. I don't think the performance >> hit will be too bad. But we need to be careful not to hold locks while doing >> I/O, which might require some rearrangement of the code. We might want to do >> a similar dance that we do in vacuum, and call visibilitymap_pin first, then >> lock and update the heap page, and then set the VM bit while holding the >> lock on the heap page. > > Here's an attempt at implementing the necessary gymnastics. Is there a quick synopsis of why you have to do (sometimes) the pin->lock->unlock->pin->lock mechanic? How come you only can fail to get the pin at most once? merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] crash-safe visibility map, take five
On Fri, May 6, 2011 at 2:47 PM, Robert Haas wrote: > Comments? At my day job there is saying: "Silence is consent". I am surprised there has not been more discussion of this change, considering the magnitude of the possibilities it unlocks. -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers