On Fri, Jun 16, 2023 at 9:18 AM Andrey M. Borodin
wrote:
> Hi hackers,
>
> Relcache errors from time to time detect catalog corruptions. For example,
> recently I observed following:
> 1. Filesystem or nvme disk zeroed out leading 160Kb of catalog index. This
> type of corruption passes through data_checksums.
> 2. RelationBuildTupleDesc() was failing with "catalog is missing 1
> attribute(s) for relid 2662".
> 3. We monitor corruption error codes and alert on-call DBAs when see one,
> but the message is not marked as XX001 or XX002. It's XX000 which happens
> from time to time due to less critical reasons than data corruption.
> 4. High-availability automation switched primary to other host and other
> monitoring checks did not ring too.
>
> This particular case is not very illustrative. In fact we had index
> corruption that looked like catalog corruption.
> But still it looks to me that catalog inconsistencies (like relnatts !=
> number of pg_attribute rows) could be marked with ERRCODE_DATA_CORRUPTED.
> This particular error code in my experience proved to be a good indicator
> for early corruption detection.
>
> What do you think?
> What other subsystems can be improved in the same manner?
>
> Best regards, Andrey Borodin.
>
Andrey, I think this is a good idea. But your #1 item sounds familiar.
There was a thread about someone creating/dropping lots of databases, who
found some kind of race condition that would ZERO out pg_ catalog entries,
just like you are mentioning. I think he found the problem with that
relations could not be found and/or the DB did not want to start. I just
spent 30 minutes looking for it, but my "search-fu" is apparently failing.
Which leads me to ask if there is a way to detect the corrupting write
(writing all zeroes to the file when we know better? A Zeroed out header
when one cannot exist?) Hoping this triggers a bright idea on your end...
Kirk...