Re: [HACKERS] FPI
On Tue, Feb 1, 2011 at 12:41 PM, Tom Lane wrote: > Robert Haas writes: >> So I'm back to proposing that we just apply FPI-free WAL records >> unconditionally, without regard to the LSN. This could potentially >> corrupt the page, of course. > > Yes. So you're still assuming that there will be a later FPI-containing > WAL record to fix up the mess you created. What if there isn't? In that case, the page shouldn't be corrupted. The possibility of corruption comes from the fact that a future WAL record might rearrange the page contents so that the current WAL record is no longer applying to the set of tuples it expects to be seeing. But any such action would necessarily induce an FPI. If there is no such action, then how can the page get into a state where replaying a heap delete will corrupt it? For that to happen, the item pointer list has to have changed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FPI
Robert Haas writes: > So I'm back to proposing that we just apply FPI-free WAL records > unconditionally, without regard to the LSN. This could potentially > corrupt the page, of course. Yes. So you're still assuming that there will be a later FPI-containing WAL record to fix up the mess you created. What if there isn't? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FPI
On Mon, Jan 31, 2011 at 10:01 AM, Robert Haas wrote: > On Fri, Jan 28, 2011 at 3:39 PM, Robert Haas wrote: >> What happens if we (a) keep the current rule after reaching >> consistency and (b) apply any such updates *unconditionally* - that >> is, without reference to the LSN - prior to reaching consistency? >> Under that rule, if we encounter an FPI before reaching consistency, >> we're OK. So let's suppose we don't. No matter how many times we >> replay any initial prefix of any such updates between the redo pointer >> and the point at which we reach consistency, the state of the page >> when we finally reach consistency will be identical. But we could get >> hosed if replay progressed *past* the minimum recovery point and then >> started over at the previous redo pointer. If we forced an immediate >> restartpoint on reaching consistency, that seems like it might prevent >> that scenario. > > Actually, I'm wrong, and this doesn't work at all. At the time of the > crash, there could already be pages on disk with LSNs greater than the > minimum recovery point. Duh. > > It was such a good idea in my head... I should mention that most of this idea was Heikki's, originally. Except for the crappy parts that don't work - those are all me. But I'm back to thinking this can work. Heikki pointed out to me on IM today that in crash recovery, we always replay to end-of-WAL before opening for connections, and for Hot Standby every block we write advances the minimum recovery point to its LSN. This implies that if we're accepting connections (either regular or Hot Standby) or at a valid stopping point for PITR, there are no unreplayed WAL records whose changes are reflected in blocks on disk. So I'm back to proposing that we just apply FPI-free WAL records unconditionally, without regard to the LSN. This could potentially corrupt the page, of course. Consider delete (no FPI) - vacuum (with FPI) - crash, leaving the vacuum page half on disk. Now the replay of the delete is probably going to do the wrong thing, because the page is torn. But it doesn't matter, because the vacuum's FPI will overwrite the page anyway, and whatever stupid thing the delete replay did will become irrelevant - BEFORE we can begin processing any queries. On the other hand, if the delete record *isn't* followed by an FPI, but just, by, say, a bunch more deletes, then it should all Just Work (TM). As long as the page header (excluding LSN and TLI, which we're ignoring by stipulation) and item pointer list are intact, we can redo those deletes and clean things up. And if they're not intact, then we must've done something that emits an FPI, and so any temporary page corruption will get overwritten when we get to that point in the WAL stream... That is a bit ugly, though, because it means the XLOG replay of FPI-free records would have to be prepared to just punt if they encounter any sort of corruption, in the sure hope that any such corruption must imply the presence of a future FPI that will be replayed - since if there is no such future FPI, it should be impossible for the page to be corrupted in the first place. But that might reduce our chances of being able to detect real corruption. Heikki also came up with another idea that might be worth exploring: at the point when we currently emit FPIs, emit an image of just the part of the page that precedes pd_lower - the page header and item IDs. To make this work, we'd have to make a rule that redo isn't allowed to rely for correctness on any bits following the pd_lower boundary - it can write those bits, but it can't read them. But most of the XLOG_HEAP records follow that rule already - we look at the item pointers to figure out where we're putting a new tuple or to locate an existing tuple and unconditionally overwrite some of its bits. The obvious exception is XLOG_HEAP2_CLEAN, emitted by VACUUM, which would probably need to just log the entire page. Also, we'd again need to apply records unconditionally, without reference to the page LSN, until we reached the minimum recovery point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FPI
On Fri, Jan 28, 2011 at 8:08 PM, Tom Lane wrote: > 3. Page LSN > WAL location: do NOT apply field update or change LSN. > I don't think this works. There could be multiple writes to a page for different records before the crash occurs. The LSN could be far in the future and yet still have a torn page missing the update. Another thing to consider is that there's still a plan on the table to implement block checksums. Does any of this block that? Or do checksums make it *easier* to implement any of this? You can check the checksum and if it's valid assume there isn't a torn page. If the LSN >= current you can skip the log record. If the checksum is invalid you could try replaying the log entry and if it makes it valid then you're golden. If not then you could continue and hopefully there will be more unconditional records and eventually the block will become consistent or a FP write will come along later. Just for reference the Oracle solution is to ignore the problem but provide recovery tools to recover an individual block. You go to the last consistent backup, pull the old version of the block from there, then apply the logs from that point forward replaying any records for that block. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FPI
On Fri, Jan 28, 2011 at 3:39 PM, Robert Haas wrote: > What happens if we (a) keep the current rule after reaching > consistency and (b) apply any such updates *unconditionally* - that > is, without reference to the LSN - prior to reaching consistency? > Under that rule, if we encounter an FPI before reaching consistency, > we're OK. So let's suppose we don't. No matter how many times we > replay any initial prefix of any such updates between the redo pointer > and the point at which we reach consistency, the state of the page > when we finally reach consistency will be identical. But we could get > hosed if replay progressed *past* the minimum recovery point and then > started over at the previous redo pointer. If we forced an immediate > restartpoint on reaching consistency, that seems like it might prevent > that scenario. Actually, I'm wrong, and this doesn't work at all. At the time of the crash, there could already be pages on disk with LSNs greater than the minimum recovery point. Duh. It was such a good idea in my head... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FPI
On Fri, Jan 28, 2011 at 3:08 PM, Tom Lane wrote: > Robert Haas writes: >> Any substantive comments, besides the obvious "this is not 9.1 material"? > > Now that I've absorbed a bit more caffeine, let's see if I can think > straight this time. > > General principle you want to assert: any WAL entry that merely results > in setting a deterministic field to a deterministic value shouldn't need > a FPI, since it is easy to check whether the field has that value and > re-apply the update if needed. The way this would have to work is: > > 1. Page LSN < WAL location: apply field update, set page LSN = WAL location. > > 2. Page LSN = WAL location: check if field matches, apply update if not. > > 3. Page LSN > WAL location: do NOT apply field update or change LSN. > > Now the issue is what happens if a torn-page event causes the LSN to be > out of sync with the page contents. If the LSN is too small (ie, the > actual page contents come from some later WAL entry) we may mistakenly > apply action 1 or 2 to bytes that don't actually represent the field we > think they do. Now, that would be all right once we replay the later > WAL entry and replace the page data from its FPI. But there are two > huge risks here: one being that in a PITR operation the user might tell > us to stop short of applying the later WAL entry, and the other being > that in any case we'll have an interval where the page is corrupt, which > is a problem if any hot-standby queries try to look at it. > > I think we might be all right with that if we can guarantee that any such > inconsistencies only exist before the system believes that it's reached > a consistent database state, but I'm not quite sure if that's true or > not. I wasn't too sure either, at first, but I think it must be true. If it were possible to reach consistency while there were still torn pages on disk, then we could enter normal running with those pages still on disk by stopping recovery at that point. And clearly that's not going to fly unless we're talking about something like hint bits, where nothing's actually busted if we only get half the update. > Hmm, no, it doesn't work, because the above argument assumes there is a > later FPI-containing WAL entry at all. Suppose we have a sequence of > several single-field-setting WAL entries, and there is no FPI-containing > WAL entry for the page before end of WAL. We could have a torn page > such that the LSN comes from one of the later entries but the field that > should be set from an earlier entry is old. When we replay the earlier > entry, case 3 will apply, so we do nothing ... incorrectly. And there > will be no FPI to fix it. What happens if we (a) keep the current rule after reaching consistency and (b) apply any such updates *unconditionally* - that is, without reference to the LSN - prior to reaching consistency? Under that rule, if we encounter an FPI before reaching consistency, we're OK. So let's suppose we don't. No matter how many times we replay any initial prefix of any such updates between the redo pointer and the point at which we reach consistency, the state of the page when we finally reach consistency will be identical. But we could get hosed if replay progressed *past* the minimum recovery point and then started over at the previous redo pointer. If we forced an immediate restartpoint on reaching consistency, that seems like it might prevent that scenario. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FPI
Robert Haas writes: > Any substantive comments, besides the obvious "this is not 9.1 material"? Now that I've absorbed a bit more caffeine, let's see if I can think straight this time. General principle you want to assert: any WAL entry that merely results in setting a deterministic field to a deterministic value shouldn't need a FPI, since it is easy to check whether the field has that value and re-apply the update if needed. The way this would have to work is: 1. Page LSN < WAL location: apply field update, set page LSN = WAL location. 2. Page LSN = WAL location: check if field matches, apply update if not. 3. Page LSN > WAL location: do NOT apply field update or change LSN. Now the issue is what happens if a torn-page event causes the LSN to be out of sync with the page contents. If the LSN is too small (ie, the actual page contents come from some later WAL entry) we may mistakenly apply action 1 or 2 to bytes that don't actually represent the field we think they do. Now, that would be all right once we replay the later WAL entry and replace the page data from its FPI. But there are two huge risks here: one being that in a PITR operation the user might tell us to stop short of applying the later WAL entry, and the other being that in any case we'll have an interval where the page is corrupt, which is a problem if any hot-standby queries try to look at it. I think we might be all right with that if we can guarantee that any such inconsistencies only exist before the system believes that it's reached a consistent database state, but I'm not quite sure if that's true or not. Hmm, no, it doesn't work, because the above argument assumes there is a later FPI-containing WAL entry at all. Suppose we have a sequence of several single-field-setting WAL entries, and there is no FPI-containing WAL entry for the page before end of WAL. We could have a torn page such that the LSN comes from one of the later entries but the field that should be set from an earlier entry is old. When we replay the earlier entry, case 3 will apply, so we do nothing ... incorrectly. And there will be no FPI to fix it. It doesn't work. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FPI
On Fri, Jan 28, 2011 at 1:42 PM, Tom Lane wrote: > Robert Haas writes: >> On Fri, Jan 28, 2011 at 10:13 AM, Tom Lane wrote: >>> Say what? A heap deletion compacts the page --- it will certainly fail >>> badly on torn-page. > >> What do you mean by "compacts the page"? I would interpret that to >> mean "reclaims the space formerly used by the tuple being deleted", >> but it certainly can't do that. > > Oh, I'm sorry, I was confusing the case with vacuum cleanup. Obviously > hadn't consumed any caffeine yet. Nevermind ... OK. You had me worried for a minute there. :-) Any substantive comments, besides the obvious "this is not 9.1 material"? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FPI
Robert Haas writes: > On Fri, Jan 28, 2011 at 10:13 AM, Tom Lane wrote: >> Say what? A heap deletion compacts the page --- it will certainly fail >> badly on torn-page. > What do you mean by "compacts the page"? I would interpret that to > mean "reclaims the space formerly used by the tuple being deleted", > but it certainly can't do that. Oh, I'm sorry, I was confusing the case with vacuum cleanup. Obviously hadn't consumed any caffeine yet. Nevermind ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FPI
On Fri, Jan 28, 2011 at 10:13 AM, Tom Lane wrote: > Robert Haas writes: >> I was thinking about full-page writes again tonight. I'm still >> wondering about the feasibility of getting rid of full-page writes for >> certain operations. We can do this, I think, in any case where we can >> convince ourselves that if the original operation, or a redo of the >> original operation, leaves behind a torn page, a subsequent redo will >> still DTRT. > >> I think that both tuple freezing (XLOG_HEAP2_FREEZE) and heap >> deletions (XLOG_HEAP_DELETE) are close to having this property. > > Say what? A heap deletion compacts the page --- it will certainly fail > badly on torn-page. What do you mean by "compacts the page"? I would interpret that to mean "reclaims the space formerly used by the tuple being deleted", but it certainly can't do that. The transaction might not commit, and in any case the tuple will still be visible to concurrent snapshots. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] FPI
Robert Haas writes: > I was thinking about full-page writes again tonight. I'm still > wondering about the feasibility of getting rid of full-page writes for > certain operations. We can do this, I think, in any case where we can > convince ourselves that if the original operation, or a redo of the > original operation, leaves behind a torn page, a subsequent redo will > still DTRT. > I think that both tuple freezing (XLOG_HEAP2_FREEZE) and heap > deletions (XLOG_HEAP_DELETE) are close to having this property. Say what? A heap deletion compacts the page --- it will certainly fail badly on torn-page. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] FPI
I was thinking about full-page writes again tonight. I'm still wondering about the feasibility of getting rid of full-page writes for certain operations. We can do this, I think, in any case where we can convince ourselves that if the original operation, or a redo of the original operation, leaves behind a torn page, a subsequent redo will still DTRT. I think that both tuple freezing (XLOG_HEAP2_FREEZE) and heap deletions (XLOG_HEAP_DELETE) are close to having this property. The updates are pretty much unconditional - they certainly don't depend on existing xmin/xmax/ctid/infomask bits being in any particular state. But this (in the redo logic) is a problem: if (XLByteLE(lsn, PageGetLSN(page))) { UnlockReleaseBuffer(buffer); return; } A crash might leave the updated LSN on disk, while the rest of the page isn't. If this is workable at all, we'd need to at least suppress the above behavior (I believe it would still be OK after we reach consistency, but not before). Tonight I realized that I think we could also make this property hold for updates, only with respect to the block containing the *old* tuple (which is basically getting the equivalent of a delete). While reducing WAL just for deletes and freezing might be judged not worth the additional complexity, maybe being able to improve the update case would put it over the top. Thoughts? Any idea how much of our WAL volume comes from FPIs vs. everything else? Index changes vs. heap changes? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers