Greetings, * Andres Freund (and...@anarazel.de) wrote: > As detailed in > https://postgr.es/m/20190319200050.ncuxejradurjakdc%40alap3.anarazel.de > the way the backend's basebackup checksum verification works makes its > error detection capabilities very dubious.
I disagree that it's 'very dubious', even with your analysis. I thought Robert's response was generally good, pointing out that we're talking about this being an issue if the corruption happens in a certain set of bytes. That said, I'm happy to see improvements in this area but I'm flat out upset about the notion that we must be perfect here- our checksums themselves aren't perfect for catching corruption either. > I think we need to fix this before the next set of backbranch releases, > or at the very least add a big fat warning that the feature isn't doing > much. I disagree about this level of urgency, but if you have a decent idea about how to improve the situation, I'm fully in support of it. > The more I think about it, the less convinced I am of the method to > avoid the torn page problem using LSNs. To make e.g. the PageIsNew() > check correct, we need to verify that the whole page is zeroes - but we > can't use the LSN for that, as it's not on the page. But there very well > could be a torn page issue with only the second half of the page being > written back (with the default 8kb pages that can trivially happen as > the kernel's pagecache commonly uses 4kb granularity). > > I basically think it needs to work like this: > > 1) Perform the entire set of PageIsVerified() checks *without* > previously checking the page's LSN, but don't error out. Performing the PageIsVerified() checks seems reasonable, I don't see any downside to doing that, so if you'd like to add that, sure, go for it. > 2) If 1) errored out, ensure that that's because the backend is > currently writing out the page. That basically requires doing what > BufferAlloc() does. So I think we'd need to end up with a function > like: > > LockoutBackendWrites(): > buf = ReadBufferWithoutRelcache(relfilenode); This is going to cause it to be pulled into shared buffers, if it isn't already there, isn't it? That seems less than ideal and isn't it going to end up just doing exactly the same PageIsVerified() call, and what happens when that fails? You're going to end up getting an ereport(ERROR) and long-jump out of here... Depending on the other code, maybe that's something you can manage, but it seems rather tricky to me. I do think, as was discussed extensively previously, that the backup *should* continue even in the face of corruption, but there should be warnings issued to notify the user of the issues. > LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE); > /* > * Reread page from OS, and recheck. This needs to happen while > * the IO lock prevents rereading from the OS. Note that we do > * not want to rely on the buffer contents here, as that could > * be very old cache contents. > */ > perform_checksum_check(relfilenode, ERROR); > > LWLockRelease(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE); > ReleaseBuffer(buf); I don't particularly like having to lock pages in this way while performing this check, espectially with having to read the page into shared buffers potentially. This also isn't the only approach to dealing with this issue that the LSN might be corrupted. There's at least two other ways we can improve the situation here- we can keep track of the highest LSN seen, perhaps on a per-file basis, and then compare those to the end-of-backup LSN, and issue a warning or perform a re-check or do something else if we discover that the LSN found was later than the end-of-backup LSN. That's not perfect, but it's certainly a good improvement over what we have today. The other approach would be to track all of the pages which were skipped and then compare them to the pages in the WAL which were archived during the backup, making sure that all pages which failed checksum exist in the WAL. That should allow us to confirm that the page was actually being modified and won't ever be used in the state that we saw it in, since it'll be replayed over by WAL, and therefore we don't have to worry about the LSN or the page itself being corrupt. Of course, that requires tracking all the pages which are modified by the WAL for the duration of the backup, and tracking all the pages which failed checksum and/or other validation, and then performing the cross-check. That seems like a fair bit of work for this, but I'm not sure that it's avoidable, ultimately. I'm happy with incremental improvements in this area though, and just checking that the LSN of pages skipped isn't completely insane would definitely be a good improvement to begin with and might be simple enough to back-patch. I don't think back-patching changes like those proposed here is a good idea. I don't have any problem adding additional documentation to explain what's being done though, with appropriate caveats at how this might not catch all types of corruption (we should do the same for the checksum feature itself, if we don't already have such caveats...). Thanks! Stephen
signature.asc
Description: PGP signature