Re: [GENERAL] pg_dump generating unrestorable data (8.4)

2011-04-06 Thread Eric McKeeth
On Sun, Apr 3, 2011 at 12:49 PM, Glenn Maynard  wrote:

> After dumping a database (pg_dump -F c database > dump), trying to restore
> it (pg_restore dump) gives:
>
> > pg_restore: [archiver (db)] Error from TOC entry 2463; 0 58451 TABLE DATA
> table user
> > pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence
> for encoding "UTF8": 0xe3273a
> > HINT:  This error can also happen if the byte sequence does not match the
> encoding expected by the server, which is controlled by "client_encoding".
> > CONTEXT:  COPY table, line 1
>
> The surface reason for this is clear enough: invalid UTF-8 data crept into
> some tsvector columns.  This is a much more serious problem, however: the
> backup tools for the database are, without warning, generating data that
> can't be restored.
>
> When in a data recovery situation, a backup that won't restore is
> catastrophic.  I can't restore the database to a state it was in at the time
> of the backup; I have to spend hours of downtime figuring out what to do to
> make something usable out of my backup; and then I have to hope I've
> corrected the backup correctly before bringing the server back online.  (If
> I was in an actual backup recovery situation--fortunately I'm not--I'd be
> more inclined to edit the Postgresql source to disable this check while
> restoring the backup than to risk trying to manually fix the backup data
> directly, which is very easy to get wrong.)
>

Two questions come to my mind on reading this. 1st, are you certain that the
database you're restoring to has the same encoding as the source database?
I'd have a hard time considering it an error if a dump from a database with
SQL_ASCII or some non-unicode encoding failed to restore to a UTF8 encoded
database in this manner, for example. And 2nd, does specifying the client
encoding when making the backup, with 'pg_dump -E UTF8', produce behavior
closer to what you would have expected?

-Eric


[GENERAL] pg_dump generating unrestorable data (8.4)

2011-04-03 Thread Glenn Maynard
After dumping a database (pg_dump -F c database > dump), trying to restore
it (pg_restore dump) gives:

> pg_restore: [archiver (db)] Error from TOC entry 2463; 0 58451 TABLE DATA
table user
> pg_restore: [archiver (db)] COPY failed: ERROR:  invalid byte sequence for
encoding "UTF8": 0xe3273a
> HINT:  This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by "client_encoding".
> CONTEXT:  COPY table, line 1

The surface reason for this is clear enough: invalid UTF-8 data crept into
some tsvector columns.  This is a much more serious problem, however: the
backup tools for the database are, without warning, generating data that
can't be restored.

When in a data recovery situation, a backup that won't restore is
catastrophic.  I can't restore the database to a state it was in at the time
of the backup; I have to spend hours of downtime figuring out what to do to
make something usable out of my backup; and then I have to hope I've
corrected the backup correctly before bringing the server back online.  (If
I was in an actual backup recovery situation--fortunately I'm not--I'd be
more inclined to edit the Postgresql source to disable this check while
restoring the backup than to risk trying to manually fix the backup data
directly, which is very easy to get wrong.)

Recommendations:

- So long as there's any possibility of this happening, it should be
possible to force Postgresql to ignore this error, to guarantee that backups
can be restored.  If the database allowed this situation to happen in the
first place, then it should allow it to be restored from a backup too.
- There was no warning of any problem when the backup was made.  pg_dump
should warn about potential recovery problems in the data it's outputting.
This should be a fatal error unless explicitly overridden, so backups that
can't be restored won't be generated accidentally.

I don't know how this data entered the database in the first place.  The
problematic data is (so far) entirely in tsvectors, generated from
to_tsvector and/or tsvector_update_trigger, but I havn't reproduced it.
Updating the rows in question on the server (to cause its
tsvector_update_trigger to be fired) fixes the problematic tsvector
columns.  For what it's worth, 0xe3273a in the dump is within the string
"'x':17" where x is \xe3, the first byte of the UTF-8 representation of
U+30FC "ー".  (If this sounds like a known or fixed problem I'd be interested
to know, but this sort of problem in a minor subsystem like FTS shouldn't be
able to silently break backups in the first place.)

-- 
Glenn Maynard