Hello,

Although this may have to be posted to pgsql-bugs or pgsql-general, let me ask you here because the problem probably needs PostgreSQL's code fix.


[Problem]
I'm using PostgreSQL 9.1.6 on Linux. I encountered a serious problem that media recovery failed showing the following message:

FATAL: archive file "000000010000008000000028" has wrong size: 7340032 instead of 16777216

I'm using normal cp command to archive WAL files.  That is:

   archive_command = '/path/to/my_script.sh "%p" "/backup/archive_log/%f"'

<<my_script.sh>>
--------------------------------------------------
#!/bin/sh
some processing...
cp "$1" "$2"
other processing...
--------------------------------------------------


The media recovery was triggered by power failure. The disk drive that stored $PGDATA failed after a power failure. So I replaced the failed disk, and performed media recovery by creating recovery.conf and running pg_ctl start. However, pg_ctl failed with the above error message.



[Cause]
The cause is clear from the message. PostgreSQL refuses to continue media recovery when it finds an archived WAL file whose size is not 16 MB. The relevant code is in src/backend/access/transam/xlog.c:

--------------------------------------------------
  if (expectedSize > 0 && stat_buf.st_size != expectedSize)
  {
   int   elevel;

   /*
    * If we find a partial file in standby mode, we assume it's
    * because it's just being copied to the archive, and keep
    * trying.
    *
    * Otherwise treat a wrong-sized file as FATAL to ensure the
    * DBA would notice it, but is that too strong? We could try
    * to plow ahead with a local copy of the file ... but the
    * problem is that there probably isn't one, and we'd
    * incorrectly conclude we've reached the end of WAL and we're
    * done recovering ...
    */
   if (StandbyMode && stat_buf.st_size < expectedSize)
    elevel = DEBUG1;
   else
    elevel = FATAL;
   ereport(elevel,
     (errmsg("archive file \"%s\" has wrong size: %lu instead of %lu",
       xlogfname,
       (unsigned long) stat_buf.st_size,
       (unsigned long) expectedSize)));
   return false;
  }
--------------------------------------------------


[How to fix]
Archived files can become smaller than their expected sizes for some reasons:

1. The power fails while archive_command is copying files (as in my case).
2. Immediate shutdown (pg_ctl stop -mi) is performed while archive_command is copying files. In this case, cp or equivalent copying command is cancelled by SIGQUIT sent by postmaster.

Therefore, I think postgres must continue recovery by fetching files from pg_xlog/ when it encounters a partially filled archive files. In addition, it may be necessary to remove the partially filled archived files, because they might prevent media recovery in the future (is this true?). I mean we need the following fix. What do you think?

--------------------------------------------------
  if (expectedSize > 0 && stat_buf.st_size != expectedSize)
  {
   int   elevel;
...
   if (StandbyMode && stat_buf.st_size < expectedSize)
    elevel = DEBUG1;
   else
   {
    elevel = LOG;
    unlink(xlogpath);
   }
   ereport(elevel,
     (errmsg("archive file \"%s\" has wrong size: %lu instead of %lu",
       xlogfname,
       (unsigned long) stat_buf.st_size,
       (unsigned long) expectedSize)));
   return false;
  }
--------------------------------------------------


I've heard that the next minor release is scheduled during this weekend. I really wish this problem will be fixed in that release. If you wish, I'll post the patch tomorrow or the next day. Could you include the fix in the weekend release?


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to