Michael, * Michael Banck (michael.ba...@credativ.de) wrote: > On Sun, Mar 04, 2018 at 06:19:00PM +0100, Magnus Hagander wrote: > > So sure, if we go with WARNING + exit with an errorcode, that is perhaps > > the best combination of the two. > > I had a look at how to go about this, but it appears to be a bit > complicated; the first problem is that sendFile() and sendDir() don't > have status return codes that could be set on checksum verifcation > failure. So I added a global variable and threw an ereport(ERROR) at the > end of perform_base_backup(), but then I realized that `pg_basebackup' > the client program purges the datadir it created if it gets an error: > > |pg_basebackup: final receive failed: ERROR: Checksum mismatch during > |basebackup > | > |pg_basebackup: removing data directory "data2"
Oh, ugh. > So I guess this would have to be sent back via the replication protocol, > but I don't see an off-hand way to do this easily? The final ordinary result set could be extended to include the information about checksum failures..? I'm a bit concerned about what to do when there are a lot of checksum failures though.. Ideally, you'd identify all of the pages in all of the files where a checksum failed (just throwing an error such as the one proposed above is really rather terrible since you have no idea what block, or even what table, failed the checksum...). I realize this is moving the goalposts a long way, but I had actually always envisioned having file-by-file pg_basebackup being put in at some point, instead of tablespace-by-tablespace, which would allow for both an ordinary result set being returned for each file that could contain information such as the checksum failure and what pages failed, and be a stepping stone for parallel pg_basebackup.. > Another option would be to see whether it is possible to verify the > checksum on the client side, but then only pg_basebackup (and no other > possible external tools using BASE_BACKUP) would profit. Reviewing the original patch and considering this issue, I believe there may be a larger problem- while very unlikely, there's been concern that it's possible to read a half-written page (and possibly only the second half) and end up with a checksum failure due to that. In pgBackRest, we address that by doing another read of the page and by checking the LSN vs. where we started the backup (if the LSN is more recent than when the backup started then we don't have to care about the page- it'll be in the WAL). If we're going to solve that issue the same way pgBackRest does, then you'd really have to do it server-side, I'm afraid.. Either that, or add a way for the client to request individual blocks be re-sent, but that would be awful difficult for pg_basebackup to update in the resulting tar file if it was compressed.. Thanks! Stephen
signature.asc
Description: PGP signature