RE: [HACKERS] CRCs
Instead of a partial row CRC, we could just as well use some other bit of identifying information, say the row OID. Given a block CRC on the heap page, we'll be pretty confident already that the heap page is OK, we just need to guard against the possibility that it's older than the index item. Checking that there is a valid tuple at the slot indicated by the index item, and that it has the right OID, should be a good enough (and cheap enough) test. This would work in 7.1 but not in 7.2 anyway (assuming UNDO and true transaction rollback to be implemented). There will be no permanent pg_log and after crash recovery any heap tuple with unknown t_xmin status will be assumed as committed. Rollback will remove tuples inserted by uncommitted transactions but this will be possible only for *logged* modifications. One should properly configure disk drives instead of hacking arround this problem. "Log before modifying data pages" is *rule* for any WAL system like Oracle, Informix and dozen others. Vadim
AW: [HACKERS] CRCs
Instead of a partial row CRC, we could just as well use some other bit of identifying information, say the row OID. Given a block CRC on the heap page, we'll be pretty confident already that the heap page is OK, we just need to guard against the possibility that it's older than the index item. Checking that there is a valid tuple at the slot indicated by the index item, and that it has the right OID, should be a good enough (and cheap enough) test. I would hardly call an additional 4 bytes for OID per index entry cheap. Andreas
Re: [HACKERS] CRCs
Andreas SB Zeugswetter wrote: Tom Lane wrote: Instead of a partial row CRC, we could just as well use some other bit of identifying information, say the row OID. ... Checking that there is a valid tuple at the slot indicated by the index item, and that it has the right OID, should be a good enough (and cheap enough) test. I would hardly call an additional 4 bytes for OID per index entry cheap. "Cheap enough" is very different from "cheap". Undetected corruption may be arbitrarily expensive when it finally manifests itself. That said, maybe storing just the low byte or two of the OID in the index would be good enough. Also, maybe the OID would be there by default, but could be ifdef'd out if the size of the indices affects you noticeably, and you know that your equipment (unlike most) really does implement strict write ordering. Nathan Myers [EMAIL PROTECTED]
Re: [HACKERS] CRCs
On Fri, Jan 12, 2001 at 11:30:30PM -0500, Tom Lane wrote: AFAICS, disk-block CRCs do not guard against mishaps involving intended writes. They will help guard against data corruption that might creep in due to outside factors, however. Right. Given that we seem to have agreed on that, I withdraw my complaint about disk-block-CRC not being in there for 7.1. I think we are still a ways away from the point where externally-induced corruption is a major share of our failure rate ;-). 7.2 or so will be time enough to add this feature, and I'd really rather not force another initdb for 7.1. More to the point, putting CRCs on data blocks might have unintended consequences for dump or vacuum processes. 7.1 is a monumental accomplishment even without corruption detection, and the sooner the world has it, the better. Nathan Myers [EMAIL PROTECTED]
Re: [HACKERS] CRCs
On Fri, Jan 12, 2001 at 04:38:37PM -0800, Mikheev, Vadim wrote: Example. 1. Tuple was inserted into index. 2. Looking for free buffer bufmgr decides to write index block. 3. Following WAL core rule bufmgr first calls XLogFlush() to write and fsync log record related to index tuple insertion. 4. *Believing* that log record is on disk now (after successful fsync) bufmgr writes index block. If log record was not really flushed on disk in 3. but on-disk image of index block was updated in 4. and system crashed after this then after restart recovery you'll have unlawful index tuple pointing to where? Who knows! No guarantee that corresponding heap tuple was flushed on disk. Isn't database corrupted now? Note, I haven't read the WAL code, so much of what I've said is based on what I know is and isn't possible with logging, rather than on Vadim's actual choices. I know it's *possible* to implement a logging database which can maintain consistency without need for strict write ordering; but without strict write ordering, it is not possible to guarantee durable transactions. That is, after a power outage, such a database may be guaranteed to recover uncorrupted, but some number (= 0) of the last few acknowledged/committed transactions may be lost. Vadim's implementation assumes strict write ordering, so that (e.g.) with IDE disks a corrupt database is possible in the event of a power outage. (Database and OS crashes don't count; those don't keep the blocks from finding their way from on-disk buffers to disk.) This is no criticism; it is more efficient to assume strict write ordering, and a database that can lose (the last few) committed transactions has limited value. To achieve disk write-order independence is probably not a worthwhile goal, but for systems that cannot provide strict write ordering (e.g., most PCs) it would be helpful to be able to detect that the database has become corrupted. In Vadim's example above, if the index were to contain not only the heap blocks' numbers, but also their CRCs, then the corruption could be detected when the index is used. When the block is read in, its CRC is checked, and when it is referenced via the index, the two CRC values are simply compared and the corruption is revealed. On a machine that does provide strict write ordering, the CRCs in the index might be unnecessary overhead, but they also provide cross-checks to help detect corruption introduced by bugs and whatnot. Or maybe I don't know what I'm talking about. Nathan Myers [EMAIL PROTECTED]
Re: [HACKERS] CRCs
On Sat, Jan 13, 2001 at 12:49:34PM -0500, Tom Lane wrote: [EMAIL PROTECTED] (Nathan Myers) writes: ... for systems that cannot provide strict write ordering (e.g., most PCs) it would be helpful to be able to detect that the database has become corrupted. In Vadim's example above, if the index were to contain not only the heap blocks' numbers, but also their CRCs, then the corruption could be detected when the index is used. ... A row-level CRC might be useful for this, but it would have to be on the data only (not the tuple commit-status bits). It'd be totally impractical with a block CRC, I think. ... I almost wrote about an indirect scheme to share the expected block CRC value among all the index entries that need it, but thought it would distract from the correct approach: Instead of a partial row CRC, we could just as well use some other bit of identifying information, say the row OID. ... Good. But, wouldn't the TID be more specific? True, it would be pretty unlikely for a block to have an old tuple with the right OID in the same place. Belt-and-braces says check both :-). Either way, the check seems independent of block CRCs. Would this check be simple enough to be safe for 7.1? Nathan Myers [EMAIL PROTECTED]
Re: [HACKERS] CRCs
On Sunday 14 January 2001 04:49, Tom Lane wrote: A row-level CRC might be useful for this, but it would have to be on the data only (not the tuple commit-status bits). It'd be totally impractical with a block CRC, I think. To do it with a block CRC, every time you changed *anything* in a heap page, you'd have to find all the index items for each row on the page and update their copies of the heap block's CRC. That could easily turn one disk-write into hundreds, not to mention the index search costs. Similarly, a check value that is affected by tuple status updates would enormously increase the cost of marking tuples committed or dead. Ah, finally. Looks like we are moving in circles (or spirals ;-) )Remember that some 3-4 months ago I requested help from this list several times regarding a trigger function that implements a crc only on the user defined attributes? I wrote one in pgtcl which was slow and had trouble with the C equivalent due to lack of documentation. I still believe this is that useful that it should be an option in Postgresand not a user defined function. Horst
Re: [HACKERS] CRCs
[EMAIL PROTECTED] (Nathan Myers) writes: Instead of a partial row CRC, we could just as well use some other bit of identifying information, say the row OID. ... Good. But, wouldn't the TID be more specific? Uh, the TID *is* the pointer from index to heap. There's no redundancy that way. Would this check be simple enough to be safe for 7.1? It'd probably be safe, but adding OIDs to index tuples would force an initdb, which I'd rather avoid at this stage of the cycle. regards, tom lane
AW: [HACKERS] CRCs (was Re: [GENERAL] Re: Loading optimization)
A disk-block CRC would detect partially written blocks (ie, power drops after disk has written M of the N sectors in a block). The disk's own checks will NOT consider this condition a failure. But physical log recovery will rewrite every page that was changed after last checkpoint, thus this is not an issue anymore. I'm not convinced that WAL will reliably detect it either (Vadim?). Certainly WAL will not help for corruption caused by external agents, away from any updates that are actually being performed/logged. The external agent (if malvolent) could write a correct CRC anyway. If on the other hand the agent writes complete garbage, vacuum will notice. Andreas
RE: [HACKERS] CRCs
But physical log recovery will rewrite every page that was changed after last checkpoint, thus this is not an issue anymore. No. That assumes that when the drive _says_ the block is written, it is really on the disk. That is not true for IDE drives. It is true for SCSI drives only when the SCSI spec is implemented correctly, but implementing the spec correctly interferes with favorable benchmark results. You know - this is *core* assumption. If drive lies about this then *nothing* will help you. Do you remember core rule of WAL? "Changes must be logged *before* changed data pages written". If this rule will be broken then data files will be inconsistent after crash recovery and you will not notice this, w/wo CRC in data blocks. I agreed that CRCs could help to detect other errors but probably it's too late for 7.1 Vadim
Re: [HACKERS] CRCs
On Fri, Jan 12, 2001 at 01:07:56PM -0800, Mikheev, Vadim wrote: But physical log recovery will rewrite every page that was changed after last checkpoint, thus this is not an issue anymore. No. That assumes that when the drive _says_ the block is written, it is really on the disk. That is not true for IDE drives. It is true for SCSI drives only when the SCSI spec is implemented correctly, but implementing the spec correctly interferes with favorable benchmark results. You know - this is *core* assumption. If drive lies about this then *nothing* will help you. Do you remember core rule of WAL? "Changes must be logged *before* changed data pages written". If this rule will be broken then data files will be inconsistent after crash recovery and you will not notice this, w/wo CRC in data blocks. You can include the data blocks' CRCs in the log entries. I agreed that CRCs could help to detect other errors but probably it's too late for 7.1. 7.2 is not too far off. I'm hoping to see it then. Nathan Myers [EMAIL PROTECTED]
RE: [HACKERS] CRCs
You know - this is *core* assumption. If drive lies about this then *nothing* will help you. Do you remember core rule of WAL? "Changes must be logged *before* changed data pages written". If this rule will be broken then data files will be inconsistent after crash recovery and you will not notice this, w/wo CRC in data blocks. You can include the data blocks' CRCs in the log entries. How could it help? Vadim
Re: [HACKERS] CRCs
On Fri, Jan 12, 2001 at 02:16:07PM -0800, Mikheev, Vadim wrote: You know - this is *core* assumption. If drive lies about this then *nothing* will help you. Do you remember core rule of WAL? "Changes must be logged *before* changed data pages written". If this rule will be broken then data files will be inconsistent after crash recovery and you will not notice this, w/wo CRC in data blocks. You can include the data blocks' CRCs in the log entries. How could it help? It wouldn't help you recover, but you would be able to report that you cannot recover. To be more specific, if the blocks referenced in the log are partially written, their CRCs will (probably) be wrong. If they are not physically written at all, their CRCs will be correct but will not match what is in the log. In either case the user will know immediately that the database has been corrupted, and must fall back on a failover image or backup. It would be no bad thing to include the CRC of the block referenced wherever in the file format that a block reference lives. Nathan Myers [EMAIL PROTECTED]
Re: [HACKERS] CRCs
[EMAIL PROTECTED] (Nathan Myers) writes: "Changes must be logged *before* changed data pages written". If this rule will be broken then data files will be inconsistent after crash recovery and you will not notice this, w/wo CRC in data blocks. You can include the data blocks' CRCs in the log entries. How could it help? It wouldn't help you recover, but you would be able to report that you cannot recover. How? The scenario Vadim is pointing out is where the disk drive writes a changed data block in advance of the WAL log entry describing the change. Then power drops and the WAL entry never gets made. At restart, how will you realize that that data block now contains data you don't want? There's not even a log entry telling you you need to look at it, much less one that tells you what should be in it. AFAICS, disk-block CRCs do not guard against mishaps involving intended writes. They will help guard against data corruption that might creep in due to outside factors, however. regards, tom lane
RE: [HACKERS] CRCs
It wouldn't help you recover, but you would be able to report that you cannot recover. How? The scenario Vadim is pointing out is where the disk drive writes a changed data block in advance of the WAL log entry describing the change. Then power drops and the WAL entry never gets made. At restart, how will you realize that that data block now contains data you don't want? There's not even a log entry telling you you need to look at it, much less one that tells you what should be in it. AFAICS, disk-block CRCs do not guard against mishaps involving intended writes. They will help guard against data corruption that might creep in due to outside factors, however. I couldn't describe better -:) Vadim
Re: [HACKERS] CRCs
On Fri, Jan 12, 2001 at 06:06:21PM -0500, Tom Lane wrote: [EMAIL PROTECTED] (Nathan Myers) writes: "Changes must be logged *before* changed data pages written". If this rule will be broken then data files will be inconsistent after crash recovery and you will not notice this, w/wo CRC in data blocks. You can include the data blocks' CRCs in the log entries. How could it help? It wouldn't help you recover, but you would be able to report that you cannot recover. How? The scenario Vadim is pointing out is where the disk drive writes a changed data block in advance of the WAL log entry describing the change. Then power drops and the WAL entry never gets made. At restart, how will you realize that that data block now contains data you don't want? There's not even a log entry telling you you need to look at it, much less one that tells you what should be in it. OK. In that case, recent transactions that were acknowledged to user programs just disappear. The database isn't corrupt, but it doesn't contain what the user believes is in it. The only way I can think of to guard against that is to have a sequence number in each acknowledgement sent to users, and also reported when the database recovers. If users log their ACK numbers, they can be compared when the database comes back up. Obviously it's better to configure the disk so that it doesn't lie about what's been written. AFAICS, disk-block CRCs do not guard against mishaps involving intended writes. They will help guard against data corruption that might creep in due to outside factors, however. Right. Nathan Myers [EMAIL PROTECTED]
Re: [HACKERS] CRCs
"Mikheev, Vadim" [EMAIL PROTECTED] writes: If log record was not really flushed on disk in 3. but on-disk image of index block was updated in 4. and system crashed after this then after restart recovery you'll have unlawful index tuple pointing to where? Who knows! No guarantee that corresponding heap tuple was flushed on disk. This example doesn't seem very convincing. Wouldn't the XLOG entry describing creation of the heap tuple appear in the log before the one for the index tuple? Or are you assuming that both these XLOG entries are lost due to disk drive malfeasance? regards, tom lane
Re: [HACKERS] CRCs
On Fri, Jan 12, 2001 at 04:10:36PM -0800, Alfred Perlstein wrote: Nathan Myers [EMAIL PROTECTED] [010112 15:49] wrote: Obviously it's better to configure the disk so that it doesn't lie about what's been written. I thought WAL+fsync wasn't supposed to allow this to happen? It's an OS and hardware configuration matter; you only get correct WAL+fsync semantics if the underlying system is configured right. IDE disks are almost always configured wrong, to spoof benchmarks; SCSI disks sometimes are. If they're configured wrong, then (now that we have a CRC in the log entry) in the event of a power outage the database might come back with recently-acknowledged transaction results discarded. That's a lot better than a corrupt database, but it's not industrial-grade semantics. (Use a UPS.) Nathan Myers [EMAIL PROTECTED]
RE: [HACKERS] CRCs
If log record was not really flushed on disk in 3. but on-disk image of index block was updated in 4. and system crashed after this then after restart recovery you'll have unlawful index tuple pointing to where? Who knows! No guarantee that corresponding heap tuple was flushed on disk. This example doesn't seem very convincing. Wouldn't the XLOG entry describing creation of the heap tuple appear in the log before the one for the index tuple? Or are you assuming that both these XLOG entries are lost due to disk drive malfeasance? Yes, that was assumed. When UNDO will be implemented and uncomitted tuples will be removed by rollback part of after crash recovery we'll get corrupted database without that assumption. Vadim
Re: [HACKERS] CRCs
Nathan Myers wrote: It wouldn't help you recover, but you would be able to report that you cannot recover. While this could help decting hardware problems, you still won't be able to detect some (many) memory errors because the CRC will be calculated on the already corrupted data. Of course there are other situations where CRC will not match and appropriately logged is a reliable heads-up warning. Bye! -- Daniele
Re: [HACKERS] CRCs
AFAICS, disk-block CRCs do not guard against mishaps involving intended writes. They will help guard against data corruption that might creep in due to outside factors, however. Right. Given that we seem to have agreed on that, I withdraw my complaint about disk-block-CRC not being in there for 7.1. I think we are still a ways away from the point where externally-induced corruption is a major share of our failure rate ;-). 7.2 or so will be time enough to add this feature, and I'd really rather not force another initdb for 7.1. regards, tom lane
[HACKERS] CRCs (was Re: [GENERAL] Re: Loading optimization)
"Mikheev, Vadim" [EMAIL PROTECTED] writes: Actually, I'd expect the CRC check to catch an all-zeroes page (if it fails to complain, then you misimplemented the CRC), so that would be the place to deal with it now. I've used standard CRC32 implementation you pointed me to -:) But CRC is used in WAL records only. Oh. I thought we'd agreed that a CRC on each stored disk block would be a good idea as well. I take it you didn't do that. Do we want to consider doing this (and forcing another initdb)? Or shall we say "too late for 7.1"? regards, tom lane
Re: [HACKERS] CRCs (was Re: [GENERAL] Re: Loading optimization)
"Mikheev, Vadim" [EMAIL PROTECTED] writes: Actually, I'd expect the CRC check to catch an all-zeroes page (if it fails to complain, then you misimplemented the CRC), so that would be the place to deal with it now. I've used standard CRC32 implementation you pointed me to -:) But CRC is used in WAL records only. Oh. I thought we'd agreed that a CRC on each stored disk block would be a good idea as well. I take it you didn't do that. No, I thought we agreed disk block CRC was way overkill. If the CRC on the WAL log checks for errors that are not checked anywhere else, then fine, but I thought disk CRC would just duplicate the I/O subsystem/disk checks. -- Bruce Momjian| http://candle.pha.pa.us [EMAIL PROTECTED] | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup.| Drexel Hill, Pennsylvania 19026
Re: [HACKERS] CRCs (was Re: [GENERAL] Re: Loading optimization)
Bruce Momjian [EMAIL PROTECTED] writes: Oh. I thought we'd agreed that a CRC on each stored disk block would be a good idea as well. I take it you didn't do that. No, I thought we agreed disk block CRC was way overkill. If the CRC on the WAL log checks for errors that are not checked anywhere else, then fine, but I thought disk CRC would just duplicate the I/O subsystem/disk checks. A disk-block CRC would detect partially written blocks (ie, power drops after disk has written M of the N sectors in a block). The disk's own checks will NOT consider this condition a failure. I'm not convinced that WAL will reliably detect it either (Vadim?). Certainly WAL will not help for corruption caused by external agents, away from any updates that are actually being performed/logged. regards, tom lane
Re: [HACKERS] CRCs (was Re: [GENERAL] Re: Loading optimization)
At 21:55 11/01/01 -0500, Tom Lane wrote: Oh. I thought we'd agreed that a CRC on each stored disk block would be a good idea as well. I take it you didn't do that. Do we want to consider doing this (and forcing another initdb)? Or shall we say "too late for 7.1"? I thought it was coming too. I'd like to see it - if it's not too hard in this release. Philip Warner| __---_ Albatross Consulting Pty. Ltd. |/ - \ (A.B.N. 75 008 659 498) | /(@) __---_ Tel: (+61) 0500 83 82 81 | _ \ Fax: (+61) 0500 83 82 82 | ___ | Http://www.rhyme.com.au |/ \| |---- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Re: [HACKERS] CRCs (was Re: [GENERAL] Re: Loading optimization)
No, I thought we agreed disk block CRC was way overkill. If the CRC on the WAL log checks for errors that are not checked anywhere else, then fine, but I thought disk CRC would just duplicate the I/O subsystem/disk checks. A disk-block CRC would detect partially written blocks (ie, power drops after disk has written M of the N sectors in a block). The disk's own checks will NOT consider this condition a failure. I'm not convinced that WAL will reliably detect it either (Vadim?). Certainly WAL will Idea proposed by Andreas about "physical log" is implemented! Now WAL saves whole data blocks on first after checkpoint modification. This way on recovery modified data blocks will be first restored *as a whole*. Isn't it much better than just detection of partially writes? Only one type of modification isn't covered at the moment - updated t_infomask of heap tuples. not help for corruption caused by external agents, away from any updates that are actually being performed/logged. What do you mean by "external agents"? Vadim
Re: [HACKERS] CRCs (was: beta testing version)
On Wed, Dec 06, 2000 at 06:53:37PM -0600, Bruce Guenter wrote: On Wed, Dec 06, 2000 at 11:08:00AM -0800, Nathan Myers wrote: On Wed, Dec 06, 2000 at 11:49:10AM -0600, Bruce Guenter wrote: I don't know how pgsql does it, but the only safe way I know of is to include an "end" marker after each record. An "end" marker is not sufficient, unless all writes are done in one-sector units with an fsync between, and the drive buffering is turned off. That's why an end marker must follow all valid records. When you write records, you don't touch the marker, and add an end marker to the end of the records you've written. After writing and syncing the records, you rewrite the end marker to indicate that the data following it is valid, and sync again. There is no state in that sequence in which partially- written data could be confused as real data, assuming either your drives aren't doing write-back caching or you have a UPS, and fsync doesn't return until the drives return success. That requires an extra out-of-sequence write. Any other way I've seen discussed (here and elsewhere) either - Assume that a CRC is a guarantee. We are already assuming a CRC is a guarantee. The drive computes a CRC for each sector, and if the CRC is OK the drive is happy. CRC errors within the drive are quite frequent, and the drive re-reads when a bad CRC comes up. The kind of data failures that a CRC is guaranteed to catch (N-bit errors) are almost precisely those that a mis-read on a hardware sector would cause. They catch a single mis-read, but not necessarily the quite likely double mis-read. ... A CRC would be a good addition to help ensure the data wasn't broken by flakey drive firmware, but doesn't guarantee consistency. No, a CRC would be a good addition to compensate for sector write reordering, which is done both by the OS and by the drive, even for "atomic" writes. But it doesn't guarantee consistency, even in that case. There is a possibility (however small) that the random data that was located in the sectors before the write will match the CRC. Generally, there are no guarantees, only reasonable expectations. A 64-bit CRC would give sufficient confidence without the out-of-sequence write, and also detect corruption from any source including power outage. (I'd also like to see CRCs on all the table blocks as well; is there a place to put them?) Nathan Myers [EMAIL PROTECTED]
RE: [HACKERS] CRCs (was: beta testing version)
That's why an end marker must follow all valid records. ... That requires an extra out-of-sequence write. Yes, and also increase probability to corrupt already committed to log data. (I'd also like to see CRCs on all the table blocks as well; is there a place to put them?) Do we need it? "physical log" feature suggested by Andreas will protect us from non atomic data block writes. Vadim
Re: [HACKERS] CRCs (was: beta testing version)
On Thu, Dec 07, 2000 at 12:22:12PM -0800, Mikheev, Vadim wrote: That's why an end marker must follow all valid records. ... That requires an extra out-of-sequence write. Yes, and also increase probability to corrupt already committed to log data. (I'd also like to see CRCs on all the table blocks as well; is there a place to put them?) Do we need it? "physical log" feature suggested by Andreas will protect us from non atomic data block writes. There are myriad sources of corruption, including RAM bit rot and software bugs. The earlier and more reliably it's caught, the better. The goal is to be able to say that a power outage won't invisibly corrupt your database. Here is are sources to a 64-bit CRC computation, under BSD license: http://gcc.gnu.org/ml/gcc/1999-11n/msg00592.html Nathan Myers [EMAIL PROTECTED]
[HACKERS] CRCs (was: beta testing version)
On Wed, Dec 06, 2000 at 11:49:10AM -0600, Bruce Guenter wrote: On Wed, Dec 06, 2000 at 11:15:26AM -0500, Tom Lane wrote: Zeugswetter Andreas SB [EMAIL PROTECTED] writes: Yes, but there would need to be a way to verify the last page or record from txlog when running on crap hardware. How exactly *do* we determine where the end of the valid log data is, anyway? I don't know how pgsql does it, but the only safe way I know of is to include an "end" marker after each record. When writing to the log, append the records after the last end marker, ending with another end marker, and fdatasync the log. Then overwrite the previous end marker to indicate it's not the end of the log any more and fdatasync again. To ensure that it is written atomically, the end marker must not cross a hardware sector boundary (typically 512 bytes). This can be trivially guaranteed by making the marker a single byte. An "end" marker is not sufficient, unless all writes are done in one-sector units with an fsync between, and the drive buffering is turned off. For larger writes the OS will re-order the writes. Most drives will re-order them too, even if the OS doesn't. Any other way I've seen discussed (here and elsewhere) either - Requires atomic multi-sector writes, which are possible only if all the sectors are sequential on disk, the kernel issues one large write for all of them, and you don't powerfail in the middle of the write. - Assume that a CRC is a guarantee. We are already assuming a CRC is a guarantee. The drive computes a CRC for each sector, and if the CRC is OK the drive is happy. CRC errors within the drive are quite frequent, and the drive re-reads when a bad CRC comes up. (If it sees errors too frequently on a sector, it rewrites it; if it sees persistent errors on a sector, it marks that one bad and relocates it.) You can expect to experience, in production, about the error rate that the drive manufacturer specifies as "maximum". ... A CRC would be a good addition to help ensure the data wasn't broken by flakey drive firmware, but doesn't guarantee consistency. No, a CRC would be a good addition to compensate for sector write reordering, which is done both by the OS and by the drive, even for "atomic" writes. It is not only "flaky" or "cheap" drives that re-order writes, or acknowledge writes as complete that have are not yet on disk. You can generally assume that *any* drive does it unless you have specifically turned that off. The assumption is that if you care, you have a UPS, or at least have configured the hardware yourself to meet your needs. It is purely wishful thinking to believe otherwise. Nathan Myers [EMAIL PROTECTED]
Re: [HACKERS] CRCs (was: beta testing version)
On Wed, Dec 06, 2000 at 11:08:00AM -0800, Nathan Myers wrote: On Wed, Dec 06, 2000 at 11:49:10AM -0600, Bruce Guenter wrote: On Wed, Dec 06, 2000 at 11:15:26AM -0500, Tom Lane wrote: How exactly *do* we determine where the end of the valid log data is, anyway? I don't know how pgsql does it, but the only safe way I know of is to include an "end" marker after each record. When writing to the log, append the records after the last end marker, ending with another end marker, and fdatasync the log. Then overwrite the previous end marker to indicate it's not the end of the log any more and fdatasync again. To ensure that it is written atomically, the end marker must not cross a hardware sector boundary (typically 512 bytes). This can be trivially guaranteed by making the marker a single byte. An "end" marker is not sufficient, unless all writes are done in one-sector units with an fsync between, and the drive buffering is turned off. That's why an end marker must follow all valid records. When you write records, you don't touch the marker, and add an end marker to the end of the records you've written. After writing and syncing the records, you rewrite the end marker to indicate that the data following it is valid, and sync again. There is no state in that sequence in which partially- written data could be confused as real data, assuming either your drives aren't doing write-back caching or you have a UPS, and fsync doesn't return until the drives return success. For larger writes the OS will re-order the writes. Most drives will re-order them too, even if the OS doesn't. I'm well aware of that. Any other way I've seen discussed (here and elsewhere) either - Assume that a CRC is a guarantee. We are already assuming a CRC is a guarantee. The drive computes a CRC for each sector, and if the CRC is OK the drive is happy. CRC errors within the drive are quite frequent, and the drive re-reads when a bad CRC comes up. The kind of data failures that a CRC is guaranteed to catch (N-bit errors) are almost precisely those that a mis-read on a hardware sector would cause. ... A CRC would be a good addition to help ensure the data wasn't broken by flakey drive firmware, but doesn't guarantee consistency. No, a CRC would be a good addition to compensate for sector write reordering, which is done both by the OS and by the drive, even for "atomic" writes. But it doesn't guarantee consistency, even in that case. There is a possibility (however small) that the random data that was located in the sectors before the write will match the CRC. -- Bruce Guenter [EMAIL PROTECTED] http://em.ca/~bruceg/ PGP signature