A customer of ours is quite bothered about finding zero pages in an index after a system crash. The task now is to improve the diagnosability of such an issue and be able to definitively point to the source of zero pages.
The proposed solution below has been vetted in-house at EnterpriseDB and am posting here to see any possible problems we missed, and also if the community would be interested in incorporating this capability. Background: ----------- SUSE Linux, ATCA board, 4 dual core CPUs => 8 cores, 24 GB RAM, 140 GB disk, PG 8.3.11. RAID-1 SAS with SCSIinfo reporting that write-caching is disabled. The corrupted index's file contents, based on hexdump: It has a total of 525 pages (cluster block size is 8K: per pg_controldata) Blocks 0 to 278 look sane. Blocks 279 to 518 are full of zeroes. Block 519 to 522 look sane. Block 523 is filled with zeroes. Block 524 looks sane. The tail end of blocks 278 and 522 have some non-zero data, meaning that those index pages have some valid 'Special space' contents. Also, head of blocks 519 and 524 look sane. These two findings imply that the zeroing action happened at 8K page boundary. This is a standard ext3 FS with 4K block size, so this raises question as to how we can ascertain that this was indeed a hardware/FS malfunction. And if it was a hardware/FS problem, then why didn't we see zeroes at 1/2 K boundary (generally the disk's sector size) or 4K boundary (default ext3 FS block size) which does not align with an 8 K boundary. The backup from before the crash does not have these zero-pages. Disk Page Validity Check Using Magic Number =========================================== Requirement: ------------ We have encountered quite a few zero pages in an index after a machine crash, causing this index to be unusable. Although REINDEX is an option but we have no way of telling if these zero pages were caused by hardware or filesystem or by Postgres. Postgres code analysis shows that Postgres being the culprit is a very low probablity, and similarly, since our hardware is also considered of good quality with hardware level RAID-1 over 2 disks, it is difficult to consider the hardware to be a problem. The ext3 filesystem being used is also quite a time-tested piece of software, hence it becomes very difficult to point fingers at any of these 3 components for this corruption. Postgres is being deployed as a component of a carrier-grade platform, and it is required to run unattended as much as possible. There is a High Availability monitoring component that is tasked with performing switchover to a standby node in the event of any problem with the primary node. This HA component needs to perform regular checks on health of all the other components, including Postgres, and take corrective actions. With the zero pages comes the difficulty of ascertaining whether these are legitimate zero pages, (since Postgres considers zero pages as valid (maybe leftover from previous extend-file followed by a crash)), or are these zero pages a result of FS/hardware failure. We are required to definitively differentiate between zero pages from Postgres vs. zero pages caused by hardware failure. Obviously this is not possible by the very nature of the problem, so we explored a few ideas, including per-block checksums in-block or in checksum-fork, S.M.A.R.T monitoring of disk drives, PageInit() before smgrextend() in ReadBuffer_common(), and additional member in PageHeader for a magic number. Following is an approach which we think is least invasive, and does not threaten code-breakage, yet provides a definitive detection of corruption/data-loss outside Postgres with least performance penalty. Implementation: --------------- .) The basic idea is to have a magic number in every PageHeader before it is written to disk, and check for this magic number when performing page validity checks. .) To avoid adding a new field to PageHeader, and any code breakage, we reuse an existing member of the structure. .) We exploit the following facts and assumptions: -) Relations/files are extended 8 KB (BLCKSZ) at a time. -) Every I/O unit contains PageHeader structure (table/index/fork files), which in turn contains pd_lsn as the first member. -) Every newly written block is considered to be zero filled. -) PageIsNew() assumes that if pd_upper is 0 then the page is zero. -) PageHeaderIsValid() allows zero filled pages to be considered valid. -) Anyone wishing to use a new page has to do PageInit() on the page. -) PageInit() does a MemSet(0) on the whole page. -) XLogRecPtr={x,0} is considered invalid -) XLogRecPtr={x, ~((uint32)0)} is not valid either (i.e. last byte of an xlog file (not segment)); we'll use this as the magic number. ... Above is my assumption, since it is not mentioned anywhere in the code. The XLogFileSize calculation seems to support this assumptiopn. ... If this assumption doesn't hold good, then the previous assumption {x,0} can also be used to implement this magic number (with x > 0). -) There's only one implementation of Storage Manager, i.e. md.c. -) smgr_extend() -> mdextend() is the only place where a relation is extended. -) Writing beyond EOF in a file causes the intermediate space to become a hole, and any reads from such a hole returns zero filled pages. -) Anybody trying to extend a file makes sure that there's no cuncurrent extension going on from somewhere else. ... This is ensured either by implicit nature of the calling code, or by calling LockRelationForExtension(). .) In mdextend(), if the buffer being written is zero filled, then we write the magic number in that page's pd_lsn. ... This check can be optimized to just check sizeof(pd_lsn) worth of buffer. .) In mdextend(), if the buffer is being written beyond current EOF, then we forcibly write the intermediate blocks too, and write the magic number in each of those. ... This needs an _mdnblocks() call and FileSeek(SEEK_END)+FileWrite() calls for every block in the hole. ... Creation of holes is being assumed to be a very limited corner case, hence this performace hit is acceptable in these rare corner cases. Tests are being planned using real application, to check how many times this occurs. .) PageHeaderIsValid() needs to be modified to allow MagicNumber-followed-by-zeroes as a valid page (rather than a completely zero page) ... If the page is completely filled with zeroes, this confirms the fact that either the filesystem or the disk storage zeroed these pages, since Postgres never wrote zero pages to disk. .) PageInit() and PageIsNew() require no change. .) XLByteLT(), XLByteLE() and XLByteEQ() may be changed to contain AssertMacro( !MagicNumber(a) && !MagicNumber(b) ) .) I haven't analyzed the effects of this change on the recovery code, but I have a feeling that we might not need to change anything there. .) We can create a contrib module (standalone binary or a loadable module) that goes through each disk page and checks it for being zero filled, and raises alarm if it finds any. Thoughts welcome. -- gurjeet.singh @ EnterpriseDB - The Enterprise Postgres Company http://www.EnterpriseDB.com singh.gurj...@{ gmail | yahoo }.com Twitter/Skype: singh_gurjeet Mail sent from my BlackLaptop device