On Wed, Feb 17, 2021 at 1:46 PM Mark Dilger <mark.dil...@enterprisedb.com> wrote: > It will reconnect and retry a command one time on error. That should cover > the case that the connection to the database was merely lost. If the second > attempt also fails, no further retry of the same command is attempted, though > commands for remaining relation targets will still be attempted, both for the > database that had the error and for other remaining databases in the list. > > Assuming something is wrong with "db2", the command `pg_amcheck db1 db2 db3` > could result in two failures per relation in db2 before finally moving on to > db3. That seems pretty awful considering how many relations that could be, > but failing to soldier on in the face of errors seems a strange design for a > corruption checking tool.
That doesn't seem right at all. I think a PQsendQuery() failure is so remote that it's probably justification for giving up on the entire operation. If it's caused by a problem with some object, it probably means that accessing that object caused the whole database to go down, and retrying the object will take the database down again. Retrying the object is betting that the user interrupted connectivity between pg_amcheck and the database but the interruption is only momentary and the user actually wants to complete the operation. That seems unlikely to me. I think it's far more probably that the database crashed or got shut down and continuing is futile. My proposal is: if we get an ERROR trying to *run* a query, give up on that object but still try the other ones after reconnecting. If we get a FATAL or PANIC trying to *run* a query, give up on the entire operation. If even sending a query fails, also give up. > In v39, exit(1) is used for all errors which are intended to stop the > program. It is important to recognize that finding corruption is not an > error in this sense. A query to verify_heapam() can fail if the relation's > checksums are bad, and that happens beyond verify_heapam()'s control when the > page is not allowed into the buffers. There can be errors if the file > backing a relation is missing. There may be other corruption error cases > that I have not yet thought about. The connections' errors get reported to > the user, but pg_amcheck does not exit as a consequence of them. As > discussed above, failing to send the query to the server is not viewed as a > reason to exit, either. It would be hard to quantify all the failure modes, > but presumably the catalogs for a database could be messed up enough to cause > such failures, and I'm not sure that pg_amcheck should just abort. I agree that exit(1) should happen after any error intended to stop the program. But I think it should also happen at the end of the run if we hit any problems for which we did not stop, so that exit(0) means your database is healthy. -- Robert Haas EDB: http://www.enterprisedb.com