Hi all,

I just had the following error on one of our data bases:

ERROR:  could not access status of transaction 1038286848
DETAIL:  could not open file "pg_clog/03DE": No such file or directory

I researched on the mailing list and it looks like the usual suspect is
disk page corruption. There are few previous discussions about how to
dump the suspect disk page, e.g.:

http://archives.postgresql.org/pgsql-general/2006-10/msg01372.php
http://groups.google.com/group/comp.databases.postgresql.hackers/browse_frm/thread/4988752a6939f45a/fd9f12468e86dd3?hl=en&lr=&ie=UTF-8&rnum=8&prev=/groups%3Fq%3Dpg_filedump%26hl%3Den%26lr%3D%26ie%3DUTF-8%26selm%3D20030922162322.E12708%2540quartz.newn.cam.ac.uk%26rnum%3D8#fd9f12468e86dd3

You can probably find more searching for "ERROR:  could not access
status of transaction" or "pg_filedump".

What I could not find was a simple and fast way to find the bad block.
The error message itself is not useful in this regard, and the "binary
search" method is anything but fast on a big table.

So I wrote the following plpgsql function:

create or replace function 
  find_bad_block(p_TableName text)
  returns tid
as $find_bad_block$
  declare
    result tid;
    crtRow record;
    count bigint := 0;
  begin
    for crtRow 
    in execute 'select ctid from ' || p_TableName
    loop 
      result = crtRow.ctid;
      count := count + 1;
      if count % 500000 = 0 then
        raise notice 'rows processed: %', count;
      end if;
    end loop;
    return result;
  exception
    when OTHERS then
        raise notice '%: %', SQLSTATE, SQLERRM;
        return result;
  end;
$find_bad_block$
language plpgsql;


This will spit out the error + the last valid block visited. If there's
no bad block, you won't get the notice with the error, just the last
block of the table... worked fine for me, resulting in:

NOTICE:  58P01: could not access status of transaction 1038286848
 find_bad_block
----------------
   (657780,157)
(1 row)

Now to finding the file I should dump:

select oid from pg_database where datname = 'dbname';
  oid
-------
 16389
(1 row)

select relfilenode from pg_class where relname='table_name';
 relfilenode
-------------
       20418
(1 row)

The file is then 'db_path/base/16389/20418'. Or a collection of
'20418.x' files if the table's data is more than 1 GB, each file being a
1GB chunk of the table... so which one to dump ?

First calculate the block count in one chunk: 1024 * 1024 / block_size,
where block_size is the block size in KB with which postgres was
compiled. That's normally 8, but we have systems where it is set to 32.
If you didn't change that yourself, it is likely 8.

So in my case the block count per chunk was: 1024 * 1024 / 8 = 131072.

So the chunk number will be: 

blockId / blocks_per_chunk = 657780 / 131072 = 5

So I should look in the file 'db_path/base/16389/20418.5'... but wait,
the block id has to be made relative to the chunk file's start:
chunk_block_id = block_id % blocks_per_chunk

So the block id to use with pg_filedump is: 657780 % 131072 = 2420

So my final command line was:

pg_filedump -if -R 2420 db_path/base/16389/20418.5

resulting in something like:

[snip]
<Data> ------
 Item   1 -- Length:   44  Offset: 8148 (0x1fd4)  Flags: USED
  XMIN: 1048587  CMIN: 90130188  XMAX: 0  CMAX|XVAC: 1036845056
  Block Id: 655376  linp Index: 18451   Attributes: 1375   Size: 0
  infomask: 0x0200 (XMIN_INVALID)
  Error: Computed header length not equal to header size.
         Computed <28>  Header: <0>
 
  1fd4: 0b001000 0c475f05 00000000 0000cd3d  .....G_........=
  1fe4: 0a001000 13485f05 00020000 2b030000  .....H_.....+...
  1ff4: 2d030000 00000000 01000000           -...........
 
[snip]

So I found my bad block, and the previous and following looked OK. 

Now I want to fix just that one block even if the rows on it are lost,
as the table data is not detail-critical (massive amounts of lost data
would be critical, small loss is tolerable).

Now to the actual fixing: from my searches it looks like zeroing out the
complete block + reindex the table seems to be the recommended solution
if it is not possible to downright drop the table and restore from
backup (in my case that is not possible - this error is there from last
year's October, and all our backups failed from then to now - and nobody
checked the logs till now when I tried to upgrade postgres via slony and
failed to do it because of this error - rather telling for the
importance of this DB).

So, how to zero out the page ?

The following message shows a way to do it:

http://www.webservertalk.com/archive142-2004-7-319751.html

Basically set the 'zero_damaged_pages' setting to 'on':

http://www.postgresql.org/docs/8.1/interactive/runtime-config-developer.html

... and select that page again. Unfortunately this did not work for
me... looks like if the page header is not corrupt but only individual
tuples are a problem, the 'zero_damaged_pages' thing doesn't work.

Next try:

http://usr-share-man.org/man1/dd.html

The start of my bad block is at: 8 * 2420 = 19360K, so I first dump the
block for cross checking that I got the arguments right:

dd ibs=1024 if=db_path/base/16389/20418.5 count=8 skip=19360
of=cross_check_file

Then I used:

pg_filedump -if -R 0 cross_check_file

and diffed against the original dump, to make it absolutely sure it's
that right block... call me paranoid, but I'm not a dd expert and I had
to be sure. It matched.

So now to the destructive part of the action:

dd ibs=1024 obs=1024 if=/dev/zero of=db_path/base/16389/20418.5 count=8
seek=19360 conv=notrunc

Ok, I admit that on first try I didn't put the 'conv=notrunc' there...
so with all the former paranoia I still ended up with a truncated chunk.
Luckily I remembered I have a WAL logging based replica, so I recovered
the rest of the truncated file from the replica's same file... this
being an insert only table I was lucky I guess that this was an option.
To my surprise, the same block on the replica was not mangled... I say
to my surprise, because on other occasions the bad blocks readily
replicated over. In any case if you have a WAL logged replica you might
be lucky to recover the corrupt block(s) from there (or just switch
over, but that is risky too, you can't know for sure in what state the
replica is, and that is actually harder to investigate than the master,
as you can execute no SQL on the replica).

Anyway, that's it... it looks the problem is gone, and the DB will be
moved to another box to postgres 8.2.4 via slony. I'm not sure how much
of that table I mangled with my dd experiments, hope that others can
learn from my mistakes...

Cheers,
Csaba.




---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Reply via email to