Hi all, I just had the following error on one of our data bases:
ERROR: could not access status of transaction 1038286848 DETAIL: could not open file "pg_clog/03DE": No such file or directory I researched on the mailing list and it looks like the usual suspect is disk page corruption. There are few previous discussions about how to dump the suspect disk page, e.g.: http://archives.postgresql.org/pgsql-general/2006-10/msg01372.php http://groups.google.com/group/comp.databases.postgresql.hackers/browse_frm/thread/4988752a6939f45a/fd9f12468e86dd3?hl=en&lr=&ie=UTF-8&rnum=8&prev=/groups%3Fq%3Dpg_filedump%26hl%3Den%26lr%3D%26ie%3DUTF-8%26selm%3D20030922162322.E12708%2540quartz.newn.cam.ac.uk%26rnum%3D8#fd9f12468e86dd3 You can probably find more searching for "ERROR: could not access status of transaction" or "pg_filedump". What I could not find was a simple and fast way to find the bad block. The error message itself is not useful in this regard, and the "binary search" method is anything but fast on a big table. So I wrote the following plpgsql function: create or replace function find_bad_block(p_TableName text) returns tid as $find_bad_block$ declare result tid; crtRow record; count bigint := 0; begin for crtRow in execute 'select ctid from ' || p_TableName loop result = crtRow.ctid; count := count + 1; if count % 500000 = 0 then raise notice 'rows processed: %', count; end if; end loop; return result; exception when OTHERS then raise notice '%: %', SQLSTATE, SQLERRM; return result; end; $find_bad_block$ language plpgsql; This will spit out the error + the last valid block visited. If there's no bad block, you won't get the notice with the error, just the last block of the table... worked fine for me, resulting in: NOTICE: 58P01: could not access status of transaction 1038286848 find_bad_block ---------------- (657780,157) (1 row) Now to finding the file I should dump: select oid from pg_database where datname = 'dbname'; oid ------- 16389 (1 row) select relfilenode from pg_class where relname='table_name'; relfilenode ------------- 20418 (1 row) The file is then 'db_path/base/16389/20418'. Or a collection of '20418.x' files if the table's data is more than 1 GB, each file being a 1GB chunk of the table... so which one to dump ? First calculate the block count in one chunk: 1024 * 1024 / block_size, where block_size is the block size in KB with which postgres was compiled. That's normally 8, but we have systems where it is set to 32. If you didn't change that yourself, it is likely 8. So in my case the block count per chunk was: 1024 * 1024 / 8 = 131072. So the chunk number will be: blockId / blocks_per_chunk = 657780 / 131072 = 5 So I should look in the file 'db_path/base/16389/20418.5'... but wait, the block id has to be made relative to the chunk file's start: chunk_block_id = block_id % blocks_per_chunk So the block id to use with pg_filedump is: 657780 % 131072 = 2420 So my final command line was: pg_filedump -if -R 2420 db_path/base/16389/20418.5 resulting in something like: [snip] <Data> ------ Item 1 -- Length: 44 Offset: 8148 (0x1fd4) Flags: USED XMIN: 1048587 CMIN: 90130188 XMAX: 0 CMAX|XVAC: 1036845056 Block Id: 655376 linp Index: 18451 Attributes: 1375 Size: 0 infomask: 0x0200 (XMIN_INVALID) Error: Computed header length not equal to header size. Computed <28> Header: <0> 1fd4: 0b001000 0c475f05 00000000 0000cd3d .....G_........= 1fe4: 0a001000 13485f05 00020000 2b030000 .....H_.....+... 1ff4: 2d030000 00000000 01000000 -........... [snip] So I found my bad block, and the previous and following looked OK. Now I want to fix just that one block even if the rows on it are lost, as the table data is not detail-critical (massive amounts of lost data would be critical, small loss is tolerable). Now to the actual fixing: from my searches it looks like zeroing out the complete block + reindex the table seems to be the recommended solution if it is not possible to downright drop the table and restore from backup (in my case that is not possible - this error is there from last year's October, and all our backups failed from then to now - and nobody checked the logs till now when I tried to upgrade postgres via slony and failed to do it because of this error - rather telling for the importance of this DB). So, how to zero out the page ? The following message shows a way to do it: http://www.webservertalk.com/archive142-2004-7-319751.html Basically set the 'zero_damaged_pages' setting to 'on': http://www.postgresql.org/docs/8.1/interactive/runtime-config-developer.html ... and select that page again. Unfortunately this did not work for me... looks like if the page header is not corrupt but only individual tuples are a problem, the 'zero_damaged_pages' thing doesn't work. Next try: http://usr-share-man.org/man1/dd.html The start of my bad block is at: 8 * 2420 = 19360K, so I first dump the block for cross checking that I got the arguments right: dd ibs=1024 if=db_path/base/16389/20418.5 count=8 skip=19360 of=cross_check_file Then I used: pg_filedump -if -R 0 cross_check_file and diffed against the original dump, to make it absolutely sure it's that right block... call me paranoid, but I'm not a dd expert and I had to be sure. It matched. So now to the destructive part of the action: dd ibs=1024 obs=1024 if=/dev/zero of=db_path/base/16389/20418.5 count=8 seek=19360 conv=notrunc Ok, I admit that on first try I didn't put the 'conv=notrunc' there... so with all the former paranoia I still ended up with a truncated chunk. Luckily I remembered I have a WAL logging based replica, so I recovered the rest of the truncated file from the replica's same file... this being an insert only table I was lucky I guess that this was an option. To my surprise, the same block on the replica was not mangled... I say to my surprise, because on other occasions the bad blocks readily replicated over. In any case if you have a WAL logged replica you might be lucky to recover the corrupt block(s) from there (or just switch over, but that is risky too, you can't know for sure in what state the replica is, and that is actually harder to investigate than the master, as you can execute no SQL on the replica). Anyway, that's it... it looks the problem is gone, and the DB will be moved to another box to postgres 8.2.4 via slony. I'm not sure how much of that table I mangled with my dd experiments, hope that others can learn from my mistakes... Cheers, Csaba. ---------------------------(end of broadcast)--------------------------- TIP 5: don't forget to increase your free space map settings