Greg Sabino Mullane <g...@endpoint.com> writes: >>> ERROR: index "pg_class_oid_index" is not a btree >> That means you got bogus data while reading the metapage. >> I'm beginning to wonder about the hardware on this server ...
> This happened again, and this time I went back through > the logs and found that it is always the exact same query causing > the issue. I also found it occuring on different servers, > which rules out RAM anyway (still shared disk, so those are suspect). > This query also sometimes gives errors like this: > ERROR: could not read block 3 of relation 1663/1554846571/3925298284: > read only 0 of 8192 bytes > However, the final number changes: these are invariably temporary relations. Oh really ... okay, maybe it is a software problem then. > Any ideas on how to carefully debug this? There are a couple of quicksorts > when I explain analyze on a non-prod system, which I am guessing where > the temp tables come from (work_mem is 24MB). No, that error message is complaining about an attempt to read an actual, named, relation (could be temp, no way to be sure from this info). A sort might generate temp files but those don't have that kind of name. > I'm not sure I understand > what could be causing both the 'read 0' and btree errors for the > same query - bad blocks on disk for one of the underlying tables? The "could not read" thing looks more like an attempt to fetch an invalid TID; you could get such an error for instance if you had an index that referenced a tuple in block 3, but the table on disk isn't that long. So one possible theory for what's happening here is that once in a while we get confused about which shared buffer holds which disk block, and either find the wrong block entirely when looking for pg_class_oid_index (the first case) or grab a page of the wrong index in the second case. This theory would be more plausible if you're wrong about the second-case tables being temp, though, because if they are temp then their indexes would be kept in local buffers not shared buffers, making it much harder to believe in a single bug causing both symptoms. One possible mechanism for confusion of that sort would be if the spinlock code wasn't quite right, or the compiler was incorrectly moving loads/stores into or out of locked sections. So it might be time to ask exactly what kind of hardware this is, which compiler PG was built with, etc. On the other hand, if the issue were of that sort then it ought to affect all buffers more or less at random; so if you're consistently seeing exactly these symptoms (in particular, if it's always pg_class_oid_index that's complained of), then I'm not sure I believe this theory either. Which PG version again? Are you in the habit of doing VACUUM FULLs on system catalogs, and if so do these glitches correlate at all with such activities? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers