On Fri, Jun 16, 2023 at 9:18 AM Andrey M. Borodin <x4...@yandex-team.ru> wrote:
> Hi hackers, > > Relcache errors from time to time detect catalog corruptions. For example, > recently I observed following: > 1. Filesystem or nvme disk zeroed out leading 160Kb of catalog index. This > type of corruption passes through data_checksums. > 2. RelationBuildTupleDesc() was failing with "catalog is missing 1 > attribute(s) for relid 2662". > 3. We monitor corruption error codes and alert on-call DBAs when see one, > but the message is not marked as XX001 or XX002. It's XX000 which happens > from time to time due to less critical reasons than data corruption. > 4. High-availability automation switched primary to other host and other > monitoring checks did not ring too. > > This particular case is not very illustrative. In fact we had index > corruption that looked like catalog corruption. > But still it looks to me that catalog inconsistencies (like relnatts != > number of pg_attribute rows) could be marked with ERRCODE_DATA_CORRUPTED. > This particular error code in my experience proved to be a good indicator > for early corruption detection. > > What do you think? > What other subsystems can be improved in the same manner? > > Best regards, Andrey Borodin. > Andrey, I think this is a good idea. But your #1 item sounds familiar. There was a thread about someone creating/dropping lots of databases, who found some kind of race condition that would ZERO out pg_ catalog entries, just like you are mentioning. I think he found the problem with that relations could not be found and/or the DB did not want to start. I just spent 30 minutes looking for it, but my "search-fu" is apparently failing. Which leads me to ask if there is a way to detect the corrupting write (writing all zeroes to the file when we know better? A Zeroed out header when one cannot exist?) Hoping this triggers a bright idea on your end... Kirk...