On 18.12.2011 20:44, David Fetter wrote:
On Sun, Dec 18, 2011 at 12:19:32PM +0200, Heikki Linnakangas wrote:
On 18.12.2011 10:54, David Fetter wrote:
On Sun, Dec 18, 2011 at 10:14:38AM +0200, Heikki Linnakangas wrote:
On 17.12.2011 23:33, David Fetter wrote:
If this introduces new failure modes, please detail, and preferably
demonstrate, just what those new modes are.

Hint bits, torn pages ->   failed CRC. See earlier discussion:

http://archives.postgresql.org/pgsql-hackers/2009-11/msg01975.php

The patch requires that full page writes be on in order to obviate
this problem by never reading a torn page.

Doesn't help. Hint bit updates are not WAL-logged.

What new failure modes are you envisioning for this case?

Umm, the one explained in the email I linked to... Let me try once more. For the sake of keeping the example short, imagine that the PostgreSQL block size is 8 bytes, and the OS block size is 4 bytes. The CRC is 1 byte, and is stored on the first byte of each page.

In the beginning, a page is in the buffer cache, and it looks like this:

AA 12 34 56  78 9A BC DE

AA is the checksum. Now a hint bit on the last byte is set, so that the page in the shared buffer cache looks like this:

AA 12 34 56  78 9A BC DF

Now PostgreSQL wants to evict the page from the buffer cache, so it recalculates the CRC. The page in the buffer cache now looks like this:

BB 12 34 56  78 9A BC DF

Now, PostgreSQL writes the page to the OS cache, with the write() system call. It sits in the OS cache for a few seconds, and then the OS decides to flush the first 4 bytes, ie. the first OS block, to disk. On disk, you now have this:

BB 12 34 56  78 9A BC DE

If the server now crashes, before the OS has flushed the second half of the PostgreSQL page to disk, you have a classic torn page. The updated CRC made it to disk, but the hint bit did not. The CRC on disk is not valid, for the rest of the contents of that page on disk.

Without CRCs, that's not a problem because the data is valid whether or not the hint bit makes it to the disk. It's just a hint, after all. But when you have a CRC on the page, the CRC is only valid if both the CRC update *and* the hint bit update makes it to disk, or neither.

So you've just turned an innocent torn page, which PostgreSQL tolerates just fine, into a block with bad CRC.

> Any way to
> simulate them, even if it's by injecting faults into the source code?

Hmm, it's hard to persuade the OS to suffer a torn page on purpose. What you could do is split the write() call in mdwrite() into two. First write the 1st half of the page, then the second. Then you can put a breakpoint in between the writes, and kill the system before the 2nd half is written.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to