RE: [HACKERS] CRCs

2001-01-16 Thread Mikheev, Vadim

 Instead of a partial row CRC, we could just as well use some other bit
 of identifying information, say the row OID.  Given a block CRC on the
 heap page, we'll be pretty confident already that the heap page is OK,
 we just need to guard against the possibility that it's older than the
 index item.  Checking that there is a valid tuple at the slot 
 indicated by the index item, and that it has the right OID, should be
 a good enough (and cheap enough) test.

This would work in 7.1 but not in 7.2 anyway (assuming UNDO and true
transaction rollback to be implemented). There will be no permanent
pg_log and after crash recovery any heap tuple with unknown t_xmin status
will be assumed as committed. Rollback will remove tuples inserted by
uncommitted transactions but this will be possible only for *logged*
modifications.

One should properly configure disk drives instead of hacking arround
this problem. "Log before modifying data pages" is *rule* for any WAL
system like Oracle, Informix and dozen others.

Vadim



AW: [HACKERS] CRCs

2001-01-15 Thread Zeugswetter Andreas SB


 Instead of a partial row CRC, we could just as well use some other bit
 of identifying information, say the row OID.  Given a block CRC on the
 heap page, we'll be pretty confident already that the heap page is OK,
 we just need to guard against the possibility that it's older than the
 index item.  Checking that there is a valid tuple at the slot indicated
 by the index item, and that it has the right OID, should be a good
 enough (and cheap enough) test.

I would hardly call an additional 4 bytes for OID per index entry cheap.

Andreas



Re: [HACKERS] CRCs

2001-01-15 Thread Nathan Myers

Andreas SB Zeugswetter wrote:
 Tom Lane wrote:
  Instead of a partial row CRC, we could just as well use some other
  bit of identifying information, say the row OID. ... Checking that
  there is a valid tuple at the slot indicated by the index item,
  and that it has the right OID, should be a good enough (and cheap
  enough) test.
 
 I would hardly call an additional 4 bytes for OID per index entry
 cheap.

"Cheap enough" is very different from "cheap".  Undetected corruption 
may be arbitrarily expensive when it finally manifests itself.  

That said, maybe storing just the low byte or two of the OID in the 
index would be good enough.  Also, maybe the OID would be there by 
default, but could be ifdef'd out if the size of the indices affects
you noticeably, and you know that your equipment (unlike most) really
does implement strict write ordering.

Nathan Myers
[EMAIL PROTECTED]



Re: [HACKERS] CRCs

2001-01-13 Thread Nathan Myers

On Fri, Jan 12, 2001 at 11:30:30PM -0500, Tom Lane wrote:
  AFAICS, disk-block CRCs do not guard against mishaps involving intended
  writes.  They will help guard against data corruption that might creep
  in due to outside factors, however.
 
  Right.  
 
 Given that we seem to have agreed on that, I withdraw my complaint about
 disk-block-CRC not being in there for 7.1.  I think we are still a ways
 away from the point where externally-induced corruption is a major share
 of our failure rate ;-).  7.2 or so will be time enough to add this
 feature, and I'd really rather not force another initdb for 7.1.

More to the point, putting CRCs on data blocks might have unintended
consequences for dump or vacuum processes.  7.1 is a monumental 
accomplishment even without corruption detection, and the sooner
the world has it, the better.

Nathan Myers
[EMAIL PROTECTED]



Re: [HACKERS] CRCs

2001-01-13 Thread Nathan Myers

On Fri, Jan 12, 2001 at 04:38:37PM -0800, Mikheev, Vadim wrote:
 Example.
 1. Tuple was inserted into index.
 2. Looking for free buffer bufmgr decides to write index block.
 3. Following WAL core rule bufmgr first calls XLogFlush() to write
and fsync log record related to index tuple insertion.
 4. *Believing* that log record is on disk now (after successful fsync)
bufmgr writes index block.
 
 If log record was not really flushed on disk in 3. but on-disk image of
 index block was updated in 4. and system crashed after this then after
 restart recovery you'll have unlawful index tuple pointing to where?
 Who knows! No guarantee that corresponding heap tuple was flushed on
 disk.
 
 Isn't database corrupted now?

Note, I haven't read the WAL code, so much of what I've said is based 
on what I know is and isn't possible with logging, rather than on 
Vadim's actual choices.  I know it's *possible* to implement a logging 
database which can maintain consistency without need for strict write 
ordering; but without strict write ordering, it is not possible to 
guarantee durable transactions.  That is, after a power outage, such 
a database may be guaranteed to recover uncorrupted, but some number 
(= 0) of the last few acknowledged/committed transactions may be lost.

Vadim's implementation assumes strict write ordering, so that (e.g.) 
with IDE disks a corrupt database is possible in the event of a power 
outage.  (Database and OS crashes don't count; those don't keep the 
blocks from finding their way from on-disk buffers to disk.)  This is 
no criticism; it is more efficient to assume strict write ordering, 
and a database that can lose (the last few) committed transactions 
has limited value.

To achieve disk write-order independence is probably not a worthwhile 
goal, but for systems that cannot provide strict write ordering (e.g., 
most PCs) it would be helpful to be able to detect that the database 
has become corrupted.  In Vadim's example above, if the index were to
contain not only the heap blocks' numbers, but also their CRCs, then 
the corruption could be detected when the index is used.  When the 
block is read in, its CRC is checked, and when it is referenced via 
the index, the two CRC values are simply compared and the corruption
is revealed. 

On a machine that does provide strict write ordering, the CRCs in the 
index might be unnecessary overhead, but they also provide cross-checks
to help detect corruption introduced by bugs and whatnot.

Or maybe I don't know what I'm talking about.  

Nathan Myers
[EMAIL PROTECTED]



Re: [HACKERS] CRCs

2001-01-13 Thread Nathan Myers

On Sat, Jan 13, 2001 at 12:49:34PM -0500, Tom Lane wrote:
 [EMAIL PROTECTED] (Nathan Myers) writes:
  ... for systems that cannot provide strict write ordering (e.g., 
  most PCs) it would be helpful to be able to detect that the database 
  has become corrupted.  In Vadim's example above, if the index were to
  contain not only the heap blocks' numbers, but also their CRCs, then 
  the corruption could be detected when the index is used.  ...
 
 A row-level CRC might be useful for this, but it would have to be on
 the data only (not the tuple commit-status bits).  It'd be totally
 impractical with a block CRC, I think.   ...

I almost wrote about an indirect scheme to share the expected block CRC
value among all the index entries that need it, but thought it would 
distract from the correct approach:

 Instead of a partial row CRC, we could just as well use some other bit
 of identifying information, say the row OID.   ...

Good.  But, wouldn't the TID be more specific?  True, it would be pretty
unlikely for a block to have an old tuple with the right OID in the same
place.  Belt-and-braces says check both :-).  Either way, the check seems 
independent of block CRCs.   Would this check be simple enough to be safe
for 7.1? 

Nathan Myers
[EMAIL PROTECTED]



Re: [HACKERS] CRCs

2001-01-13 Thread Horst Herb

On Sunday 14 January 2001 04:49, Tom Lane wrote:

 A row-level CRC might be useful for this, but it would have to be on
 the data only (not the tuple commit-status bits).  It'd be totally
 impractical with a block CRC, I think.  To do it with a block CRC, every
 time you changed *anything* in a heap page, you'd have to find all the
 index items for each row on the page and update their copies of the
 heap block's CRC.  That could easily turn one disk-write into hundreds,
 not to mention the index search costs.  Similarly, a check value that is
 affected by tuple status updates would enormously increase the cost of
 marking tuples committed or dead.

Ah, finally. Looks like we are moving in circles (or spirals ;-) )Remember 
that some 3-4 months ago I requested help from this list several times 
regarding a trigger function that implements a crc only on the user defined 
attributes? I wrote one in pgtcl which was slow and had trouble with the C 
equivalent due to lack of documentation. I still believe this is that useful 
that it should be an option in Postgresand not a user defined function.

Horst



Re: [HACKERS] CRCs

2001-01-13 Thread Tom Lane

[EMAIL PROTECTED] (Nathan Myers) writes:
 Instead of a partial row CRC, we could just as well use some other bit
 of identifying information, say the row OID.   ...

 Good.  But, wouldn't the TID be more specific?

Uh, the TID *is* the pointer from index to heap.  There's no redundancy
that way.

 Would this check be simple enough to be safe for 7.1? 

It'd probably be safe, but adding OIDs to index tuples would force an
initdb, which I'd rather avoid at this stage of the cycle.

regards, tom lane



AW: [HACKERS] CRCs (was Re: [GENERAL] Re: Loading optimization)

2001-01-12 Thread Zeugswetter Andreas SB


 A disk-block CRC would detect partially written blocks (ie, power drops
 after disk has written M of the N sectors in a block).  The disk's own
 checks will NOT consider this condition a failure.

But physical log recovery will rewrite every page that was changed
after last checkpoint, thus this is not an issue anymore.

  I'm not convinced
 that WAL will reliably detect it either (Vadim?).  Certainly WAL will
 not help for corruption caused by external agents, away from any updates
 that are actually being performed/logged.

The external agent (if malvolent) could write a correct CRC anyway.
If on the other hand the agent writes complete garbage, vacuum will notice.

Andreas



RE: [HACKERS] CRCs

2001-01-12 Thread Mikheev, Vadim

  But physical log recovery will rewrite every page that was changed
  after last checkpoint, thus this is not an issue anymore.
 
 No.  That assumes that when the drive _says_ the block is written, 
 it is really on the disk.  That is not true for IDE drives.  It is 
 true for SCSI drives only when the SCSI spec is implemented correctly,
 but implementing the spec correctly interferes with favorable 
 benchmark results.

You know - this is *core* assumption. If drive lies about this then
*nothing* will help you. Do you remember core rule of WAL?
"Changes must be logged *before* changed data pages written".
If this rule will be broken then data files will be inconsistent
after crash recovery and you will not notice this, w/wo CRC in
data blocks.

I agreed that CRCs could help to detect other errors but probably
it's too late for 7.1

Vadim



Re: [HACKERS] CRCs

2001-01-12 Thread Nathan Myers

On Fri, Jan 12, 2001 at 01:07:56PM -0800, Mikheev, Vadim wrote:
   But physical log recovery will rewrite every page that was changed
   after last checkpoint, thus this is not an issue anymore.
  
  No.  That assumes that when the drive _says_ the block is written, 
  it is really on the disk.  That is not true for IDE drives.  It is 
  true for SCSI drives only when the SCSI spec is implemented correctly,
  but implementing the spec correctly interferes with favorable 
  benchmark results.
 
 You know - this is *core* assumption. If drive lies about this then
 *nothing* will help you. Do you remember core rule of WAL?
 "Changes must be logged *before* changed data pages written".
 If this rule will be broken then data files will be inconsistent
 after crash recovery and you will not notice this, w/wo CRC in
 data blocks.

You can include the data blocks' CRCs in the log entries.

 I agreed that CRCs could help to detect other errors but probably
 it's too late for 7.1.

7.2 is not too far off.  I'm hoping to see it then.

Nathan Myers
[EMAIL PROTECTED]



RE: [HACKERS] CRCs

2001-01-12 Thread Mikheev, Vadim

  You know - this is *core* assumption. If drive lies about this then
  *nothing* will help you. Do you remember core rule of WAL?
  "Changes must be logged *before* changed data pages written".
  If this rule will be broken then data files will be inconsistent
  after crash recovery and you will not notice this, w/wo CRC in
  data blocks.
 
 You can include the data blocks' CRCs in the log entries.

How could it help?

Vadim



Re: [HACKERS] CRCs

2001-01-12 Thread Nathan Myers

On Fri, Jan 12, 2001 at 02:16:07PM -0800, Mikheev, Vadim wrote:
   You know - this is *core* assumption. If drive lies about this then
   *nothing* will help you. Do you remember core rule of WAL?
   "Changes must be logged *before* changed data pages written".
   If this rule will be broken then data files will be inconsistent
   after crash recovery and you will not notice this, w/wo CRC in
   data blocks.
  
  You can include the data blocks' CRCs in the log entries.
 
 How could it help?

It wouldn't help you recover, but you would be able to report that 
you cannot recover.

To be more specific, if the blocks referenced in the log are partially 
written, their CRCs will (probably) be wrong.  If they are not 
physically written at all, their CRCs will be correct but will 
not match what is in the log.  In either case the user will know 
immediately that the database has been corrupted, and must fall 
back on a failover image or backup.

It would be no bad thing to include the CRC of the block referenced
wherever in the file format that a block reference lives.

Nathan Myers
[EMAIL PROTECTED]



Re: [HACKERS] CRCs

2001-01-12 Thread Tom Lane

[EMAIL PROTECTED] (Nathan Myers) writes:
 "Changes must be logged *before* changed data pages written".
 If this rule will be broken then data files will be inconsistent
 after crash recovery and you will not notice this, w/wo CRC in
 data blocks.
 
 You can include the data blocks' CRCs in the log entries.
 
 How could it help?

 It wouldn't help you recover, but you would be able to report that 
 you cannot recover.

How?  The scenario Vadim is pointing out is where the disk drive writes
a changed data block in advance of the WAL log entry describing the
change.  Then power drops and the WAL entry never gets made.  At
restart, how will you realize that that data block now contains data you
don't want?  There's not even a log entry telling you you need to look
at it, much less one that tells you what should be in it.

AFAICS, disk-block CRCs do not guard against mishaps involving intended
writes.  They will help guard against data corruption that might creep
in due to outside factors, however.

regards, tom lane



RE: [HACKERS] CRCs

2001-01-12 Thread Mikheev, Vadim

  It wouldn't help you recover, but you would be able to report that 
  you cannot recover.
 
 How? The scenario Vadim is pointing out is where the disk 
 drive writes a changed data block in advance of the WAL log entry
 describing the change. Then power drops and the WAL entry never gets
 made. At restart, how will you realize that that data block now
 contains data you don't want? There's not even a log entry telling
 you you need to look at it, much less one that tells you what should
 be in it.
 
 AFAICS, disk-block CRCs do not guard against mishaps involving intended
 writes. They will help guard against data corruption that might creep
 in due to outside factors, however.

I couldn't describe better -:)

Vadim



Re: [HACKERS] CRCs

2001-01-12 Thread Nathan Myers

On Fri, Jan 12, 2001 at 06:06:21PM -0500, Tom Lane wrote:
 [EMAIL PROTECTED] (Nathan Myers) writes:
  "Changes must be logged *before* changed data pages written".
  If this rule will be broken then data files will be inconsistent
  after crash recovery and you will not notice this, w/wo CRC in
  data blocks.
  
  You can include the data blocks' CRCs in the log entries.
  
  How could it help?
 
  It wouldn't help you recover, but you would be able to report that 
  you cannot recover.
 
 How?  The scenario Vadim is pointing out is where the disk drive writes
 a changed data block in advance of the WAL log entry describing the
 change.  Then power drops and the WAL entry never gets made.  At
 restart, how will you realize that that data block now contains data you
 don't want?  There's not even a log entry telling you you need to look
 at it, much less one that tells you what should be in it.

OK.  In that case, recent transactions that were acknowledged to user 
programs just disappear.  The database isn't corrupt, but it doesn't
contain what the user believes is in it.

The only way I can think of to guard against that is to have a sequence
number in each acknowledgement sent to users, and also reported when the 
database recovers.  If users log their ACK numbers, they can be compared
when the database comes back up.

Obviously it's better to configure the disk so that it doesn't lie about
what's been written.

 AFAICS, disk-block CRCs do not guard against mishaps involving intended
 writes.  They will help guard against data corruption that might creep
 in due to outside factors, however.

Right.  

Nathan Myers
[EMAIL PROTECTED]



Re: [HACKERS] CRCs

2001-01-12 Thread Tom Lane

"Mikheev, Vadim" [EMAIL PROTECTED] writes:
 If log record was not really flushed on disk in 3. but on-disk image of
 index block was updated in 4. and system crashed after this then after
 restart recovery you'll have unlawful index tuple pointing to where?
 Who knows! No guarantee that corresponding heap tuple was flushed on
 disk.

This example doesn't seem very convincing.  Wouldn't the XLOG entry
describing creation of the heap tuple appear in the log before the one
for the index tuple?  Or are you assuming that both these XLOG entries
are lost due to disk drive malfeasance?

regards, tom lane



Re: [HACKERS] CRCs

2001-01-12 Thread Nathan Myers

On Fri, Jan 12, 2001 at 04:10:36PM -0800, Alfred Perlstein wrote:
 Nathan Myers [EMAIL PROTECTED] [010112 15:49] wrote:
 
  Obviously it's better to configure the disk so that it doesn't
  lie about what's been written.
 
 I thought WAL+fsync wasn't supposed to allow this to happen?

It's an OS and hardware configuration matter; you only get correct
WAL+fsync semantics if the underlying system is configured right.  
IDE disks are almost always configured wrong, to spoof benchmarks; 
SCSI disks sometimes are.

If they're configured wrong, then (now that we have a CRC in the 
log entry) in the event of a power outage the database might come 
back with recently-acknowledged transaction results discarded.
That's a lot better than a corrupt database, but it's not 
industrial-grade semantics.  (Use a UPS.)

Nathan Myers
[EMAIL PROTECTED]



RE: [HACKERS] CRCs

2001-01-12 Thread Mikheev, Vadim

  If log record was not really flushed on disk in 3. but 
  on-disk image of index block was updated in 4. and system
  crashed after this then after restart recovery you'll have
  unlawful index tuple pointing to where? Who knows!
  No guarantee that corresponding heap tuple was flushed on
  disk.
 
 This example doesn't seem very convincing.  Wouldn't the XLOG entry
 describing creation of the heap tuple appear in the log before the one
 for the index tuple?  Or are you assuming that both these XLOG entries
 are lost due to disk drive malfeasance?

Yes, that was assumed.
When UNDO will be implemented and uncomitted tuples will be removed by
rollback part of after crash recovery we'll get corrupted database without
that assumption.

Vadim



Re: [HACKERS] CRCs

2001-01-12 Thread Daniele Orlandi

Nathan Myers wrote:
 
 It wouldn't help you recover, but you would be able to report that
 you cannot recover.

While this could help decting hardware problems, you still won't be able
to detect some (many) memory errors because the CRC will be calculated
on the already corrupted data.

Of course there are other situations where CRC will not match and
appropriately logged is a reliable heads-up warning.

Bye!

-- 
 Daniele



Re: [HACKERS] CRCs

2001-01-12 Thread Tom Lane

 AFAICS, disk-block CRCs do not guard against mishaps involving intended
 writes.  They will help guard against data corruption that might creep
 in due to outside factors, however.

 Right.  

Given that we seem to have agreed on that, I withdraw my complaint about
disk-block-CRC not being in there for 7.1.  I think we are still a ways
away from the point where externally-induced corruption is a major share
of our failure rate ;-).  7.2 or so will be time enough to add this
feature, and I'd really rather not force another initdb for 7.1.

regards, tom lane



[HACKERS] CRCs (was Re: [GENERAL] Re: Loading optimization)

2001-01-11 Thread Tom Lane

"Mikheev, Vadim" [EMAIL PROTECTED] writes:
 Actually, I'd expect the CRC check to catch an all-zeroes page (if
 it fails to complain, then you misimplemented the CRC), so that would
 be the place to deal with it now.

 I've used standard CRC32 implementation you pointed me to -:)
 But CRC is used in WAL records only.

Oh.  I thought we'd agreed that a CRC on each stored disk block would
be a good idea as well.  I take it you didn't do that.

Do we want to consider doing this (and forcing another initdb)?
Or shall we say "too late for 7.1"?

regards, tom lane



Re: [HACKERS] CRCs (was Re: [GENERAL] Re: Loading optimization)

2001-01-11 Thread Bruce Momjian

 "Mikheev, Vadim" [EMAIL PROTECTED] writes:
  Actually, I'd expect the CRC check to catch an all-zeroes page (if
  it fails to complain, then you misimplemented the CRC), so that would
  be the place to deal with it now.
 
  I've used standard CRC32 implementation you pointed me to -:)
  But CRC is used in WAL records only.
 
 Oh.  I thought we'd agreed that a CRC on each stored disk block would
 be a good idea as well.  I take it you didn't do that.


No, I thought we agreed disk block CRC was way overkill.  If the CRC on
the WAL log checks for errors that are not checked anywhere else, then
fine, but I thought disk CRC would just duplicate the I/O subsystem/disk
checks.

-- 
  Bruce Momjian|  http://candle.pha.pa.us
  [EMAIL PROTECTED]   |  (610) 853-3000
  +  If your life is a hard drive, |  830 Blythe Avenue
  +  Christ can be your backup.|  Drexel Hill, Pennsylvania 19026



Re: [HACKERS] CRCs (was Re: [GENERAL] Re: Loading optimization)

2001-01-11 Thread Tom Lane

Bruce Momjian [EMAIL PROTECTED] writes:
 Oh.  I thought we'd agreed that a CRC on each stored disk block would
 be a good idea as well.  I take it you didn't do that.

 No, I thought we agreed disk block CRC was way overkill.  If the CRC on
 the WAL log checks for errors that are not checked anywhere else, then
 fine, but I thought disk CRC would just duplicate the I/O subsystem/disk
 checks.

A disk-block CRC would detect partially written blocks (ie, power drops
after disk has written M of the N sectors in a block).  The disk's own
checks will NOT consider this condition a failure.  I'm not convinced
that WAL will reliably detect it either (Vadim?).  Certainly WAL will
not help for corruption caused by external agents, away from any updates
that are actually being performed/logged.

regards, tom lane



Re: [HACKERS] CRCs (was Re: [GENERAL] Re: Loading optimization)

2001-01-11 Thread Philip Warner

At 21:55 11/01/01 -0500, Tom Lane wrote:

Oh.  I thought we'd agreed that a CRC on each stored disk block would
be a good idea as well.  I take it you didn't do that.

Do we want to consider doing this (and forcing another initdb)?
Or shall we say "too late for 7.1"?


I thought it was coming too. I'd like to see it - if it's not too hard in
this release.



Philip Warner| __---_
Albatross Consulting Pty. Ltd.   |/   -  \
(A.B.N. 75 008 659 498)  |  /(@)   __---_
Tel: (+61) 0500 83 82 81 | _  \
Fax: (+61) 0500 83 82 82 | ___ |
Http://www.rhyme.com.au  |/   \|
 |----
PGP key available upon request,  |  /
and from pgp5.ai.mit.edu:11371   |/



Re: [HACKERS] CRCs (was Re: [GENERAL] Re: Loading optimization)

2001-01-11 Thread Vadim Mikheev

  No, I thought we agreed disk block CRC was way overkill.  If the CRC on
  the WAL log checks for errors that are not checked anywhere else, then
  fine, but I thought disk CRC would just duplicate the I/O subsystem/disk
  checks.
 
 A disk-block CRC would detect partially written blocks (ie, power drops
 after disk has written M of the N sectors in a block).  The disk's own
 checks will NOT consider this condition a failure.  I'm not convinced
 that WAL will reliably detect it either (Vadim?).  Certainly WAL will

Idea proposed by Andreas about "physical log" is implemented!
Now WAL saves whole data blocks on first after checkpoint
modification. This way on recovery modified data blocks will be
first restored *as a whole*. Isn't it much better than just
detection of partially writes?

Only one type of modification isn't covered at the moment -
updated t_infomask of heap tuples.

 not help for corruption caused by external agents, away from any updates
 that are actually being performed/logged.

What do you mean by "external agents"?

Vadim





Re: [HACKERS] CRCs (was: beta testing version)

2000-12-07 Thread Nathan Myers

On Wed, Dec 06, 2000 at 06:53:37PM -0600, Bruce Guenter wrote:
 On Wed, Dec 06, 2000 at 11:08:00AM -0800, Nathan Myers wrote:
  On Wed, Dec 06, 2000 at 11:49:10AM -0600, Bruce Guenter wrote:
   
   I don't know how pgsql does it, but the only safe way I know of
   is to include an "end" marker after each record.
  
  An "end" marker is not sufficient, unless all writes are done in
  one-sector units with an fsync between, and the drive buffering 
  is turned off.
 
 That's why an end marker must follow all valid records.  When you write
 records, you don't touch the marker, and add an end marker to the end of
 the records you've written.  After writing and syncing the records, you
 rewrite the end marker to indicate that the data following it is valid,
 and sync again.  There is no state in that sequence in which partially-
 written data could be confused as real data, assuming either your drives
 aren't doing write-back caching or you have a UPS, and fsync doesn't
 return until the drives return success.

That requires an extra out-of-sequence write. 

   Any other way I've seen discussed (here and elsewhere) either
   - Assume that a CRC is a guarantee.  
  
  We are already assuming a CRC is a guarantee.  
 
  The drive computes a CRC for each sector, and if the CRC is OK the 
  drive is happy.  CRC errors within the drive are quite frequent, and 
  the drive re-reads when a bad CRC comes up.
 
 The kind of data failures that a CRC is guaranteed to catch (N-bit
 errors) are almost precisely those that a mis-read on a hardware sector
 would cause.

They catch a single mis-read, but not necessarily the quite likely
double mis-read.

 ... A CRC would be a good addition to
 help ensure the data wasn't broken by flakey drive firmware, but
 doesn't guarantee consistency.
  No, a CRC would be a good addition to compensate for sector write
  reordering, which is done both by the OS and by the drive, even for 
  "atomic" writes.
 
 But it doesn't guarantee consistency, even in that case.  There is a
 possibility (however small) that the random data that was located in 
 the sectors before the write will match the CRC.

Generally, there are no guarantees, only reasonable expectations.  A 
64-bit CRC would give sufficient confidence without the out-of-sequence
write, and also detect corruption from any source including power outage.

(I'd also like to see CRCs on all the table blocks as well; is there
a place to put them?)

Nathan Myers
[EMAIL PROTECTED]




RE: [HACKERS] CRCs (was: beta testing version)

2000-12-07 Thread Mikheev, Vadim

  That's why an end marker must follow all valid records.  
...
 
 That requires an extra out-of-sequence write. 

Yes, and also increase probability to corrupt already committed
to log data.

 (I'd also like to see CRCs on all the table blocks as well; is there
 a place to put them?)

Do we need it? "physical log" feature suggested by Andreas will protect
us from non atomic data block writes.

Vadim



Re: [HACKERS] CRCs (was: beta testing version)

2000-12-07 Thread Nathan Myers

On Thu, Dec 07, 2000 at 12:22:12PM -0800, Mikheev, Vadim wrote:
   That's why an end marker must follow all valid records.  
 ...
  
  That requires an extra out-of-sequence write. 
 
 Yes, and also increase probability to corrupt already committed
 to log data.
 
  (I'd also like to see CRCs on all the table blocks as well; is there
  a place to put them?)
 
 Do we need it? "physical log" feature suggested by Andreas will protect
 us from non atomic data block writes.

There are myriad sources of corruption, including RAM bit rot and
software bugs.  The earlier and more reliably it's caught, the better.
The goal is to be able to say that a power outage won't invisibly
corrupt your database.

Here is are sources to a 64-bit CRC computation, under BSD license:

  http://gcc.gnu.org/ml/gcc/1999-11n/msg00592.html

Nathan Myers
[EMAIL PROTECTED]



[HACKERS] CRCs (was: beta testing version)

2000-12-06 Thread Nathan Myers

On Wed, Dec 06, 2000 at 11:49:10AM -0600, Bruce Guenter wrote:
 On Wed, Dec 06, 2000 at 11:15:26AM -0500, Tom Lane wrote:
  Zeugswetter Andreas SB [EMAIL PROTECTED] writes:
   Yes, but there would need to be a way to verify the last page or
   record from txlog when running on crap hardware.
 
  How exactly *do* we determine where the end of the valid log data is,
  anyway?
 
 I don't know how pgsql does it, but the only safe way I know of is to
 include an "end" marker after each record.  When writing to the log,
 append the records after the last end marker, ending with another end
 marker, and fdatasync the log.  Then overwrite the previous end marker
 to indicate it's not the end of the log any more and fdatasync again.

 To ensure that it is written atomically, the end marker must not cross a
 hardware sector boundary (typically 512 bytes).  This can be trivially
 guaranteed by making the marker a single byte.

An "end" marker is not sufficient, unless all writes are done in
one-sector units with an fsync between, and the drive buffering 
is turned off.  For larger writes the OS will re-order the writes.  
Most drives will re-order them too, even if the OS doesn't.

 Any other way I've seen discussed (here and elsewhere) either
 - Requires atomic multi-sector writes, which are possible only if all
   the sectors are sequential on disk, the kernel issues one large write
   for all of them, and you don't powerfail in the middle of the write.
 - Assume that a CRC is a guarantee.  

We are already assuming a CRC is a guarantee.  

The drive computes a CRC for each sector, and if the CRC is OK the 
drive is happy.  CRC errors within the drive are quite frequent, and 
the drive re-reads when a bad CRC comes up.  (If it sees errors too 
frequently on a sector, it rewrites it; if it sees persistent errors 
on a sector, it marks that one bad and relocates it.)  You can expect 
to experience, in production, about the error rate that the drive 
manufacturer specifies as "maximum".

   ... A CRC would be a good addition to
   help ensure the data wasn't broken by flakey drive firmware, but
   doesn't guarantee consistency.

No, a CRC would be a good addition to compensate for sector write
reordering, which is done both by the OS and by the drive, even for 
"atomic" writes.

It is not only "flaky" or "cheap" drives that re-order writes, or
acknowledge writes as complete that have are not yet on disk.  You 
can generally assume that *any* drive does it unless you have 
specifically turned that off.  The assumption is that if you care,
you have a UPS, or at least have configured the hardware yourself
to meet your needs.

It is purely wishful thinking to believe otherwise.

Nathan Myers
[EMAIL PROTECTED]



Re: [HACKERS] CRCs (was: beta testing version)

2000-12-06 Thread Bruce Guenter

On Wed, Dec 06, 2000 at 11:08:00AM -0800, Nathan Myers wrote:
 On Wed, Dec 06, 2000 at 11:49:10AM -0600, Bruce Guenter wrote:
  On Wed, Dec 06, 2000 at 11:15:26AM -0500, Tom Lane wrote:
   How exactly *do* we determine where the end of the valid log data is,
   anyway?
  
  I don't know how pgsql does it, but the only safe way I know of is to
  include an "end" marker after each record.  When writing to the log,
  append the records after the last end marker, ending with another end
  marker, and fdatasync the log.  Then overwrite the previous end marker
  to indicate it's not the end of the log any more and fdatasync again.
 
  To ensure that it is written atomically, the end marker must not cross a
  hardware sector boundary (typically 512 bytes).  This can be trivially
  guaranteed by making the marker a single byte.
 
 An "end" marker is not sufficient, unless all writes are done in
 one-sector units with an fsync between, and the drive buffering 
 is turned off.

That's why an end marker must follow all valid records.  When you write
records, you don't touch the marker, and add an end marker to the end of
the records you've written.  After writing and syncing the records, you
rewrite the end marker to indicate that the data following it is valid,
and sync again.  There is no state in that sequence in which partially-
written data could be confused as real data, assuming either your drives
aren't doing write-back caching or you have a UPS, and fsync doesn't
return until the drives return success.

 For larger writes the OS will re-order the writes.  
 Most drives will re-order them too, even if the OS doesn't.

I'm well aware of that.

  Any other way I've seen discussed (here and elsewhere) either
  - Assume that a CRC is a guarantee.  
 
 We are already assuming a CRC is a guarantee.  

 The drive computes a CRC for each sector, and if the CRC is OK the 
 drive is happy.  CRC errors within the drive are quite frequent, and 
 the drive re-reads when a bad CRC comes up.

The kind of data failures that a CRC is guaranteed to catch (N-bit
errors) are almost precisely those that a mis-read on a hardware sector
would cause.

... A CRC would be a good addition to
help ensure the data wasn't broken by flakey drive firmware, but
doesn't guarantee consistency.
 No, a CRC would be a good addition to compensate for sector write
 reordering, which is done both by the OS and by the drive, even for 
 "atomic" writes.

But it doesn't guarantee consistency, even in that case.  There is a
possibility (however small) that the random data that was located in the
sectors before the write will match the CRC.
-- 
Bruce Guenter [EMAIL PROTECTED]   http://em.ca/~bruceg/

 PGP signature