Hello fine PostgreSQL bug-busters, I'm having a rather annoying problem - a particular string is causing the Postgres COPY functionality to lose a byte, causing data corruption in backups and transferred data.
First, the environment -
PostgreSQL 8.4.4 on i386-apple-darwin10.3.0, compiled by GCC
i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5646) (dot 1),
64-bit
Mac OS X 10.6.4
[ste...@xxx:~]% psql --version
psql (PostgreSQL) 8.4.4
contains support for command-line editing
Now, the setup:
Name | Owner | Encoding | Collation |
Ctype | Access privileges | Size | Tablespace | Description
baddb | xxxxxxx_production | UTF8 | en_US.utf-8 |
en_US.utf-8 | | 207 MB | pg_default |
baddb=> create table badtable (a int, b int, c character varying, d character
varying, e character varying, f character varying[], g text, h character
varying[],i character varying[], j character varying[], k character varying[],
l character varying[], m character varying[], n character varying[],o character
varying, p character varying);
baddb=> \copy badtable from '/tmp/data.copy'
baddb=> \copy badtable to '/tmp/badness.copy'
baddb=> \copy badtable from '/tmp/badness.copy'
ERROR: invalid byte sequence for encoding "UTF8": 0xcf2c
HINT: This error can also happen if the byte sequence does not match the
encoding expected by the server, which is controlled by "client_encoding".
CONTEXT: COPY badtable, line 1
Obviously, this wouldn't be too helpful without the datafile in question:
1 2377510 FOURSQUARE 1403504 Pizza Hut {} \N {}
{} {} {Pizza} {πίτσα,hut,food,ζωγράφου,pizza,eat,zografou} {}
\N \N \N
Since this is likely to be eaten by various mail clients or lost in
translation, please find attached a TGZ of the data file in question.
data.tgz
Description: Binary data
The hexdump shows exactly what goes wrong:
[ste...@xxx:/tmp]% hexdump -C data.copy
00000000 31 09 32 33 37 37 35 31 30 09 46 4f 55 52 53 51 |1.2377510.FOURSQ|
00000010 55 41 52 45 09 31 34 30 33 35 30 34 09 50 69 7a |UARE.1403504.Piz|
00000020 7a 61 20 48 75 74 09 7b 7d 09 5c 4e 09 7b 7d 09 |za Hut.{}.\N.{}.|
00000030 7b 7d 09 7b 7d 09 7b 50 69 7a 7a 61 7d 09 7b cf |{}.{}.{Pizza}.{.|
00000040 80 ce af cf 84 cf 83 ce b1 2c 68 75 74 2c 66 6f |.........,hut,fo|
00000050 6f 64 2c ce b6 cf 89 ce b3 cf 81 ce ac cf 86 ce |od,.............|
00000060 bf cf 85 2c 70 69 7a 7a 61 2c 65 61 74 2c 7a 6f |...,pizza,eat,zo|
00000070 67 72 61 66 6f 75 7d 09 7b 7d 09 5c 4e 09 5c 4e |grafou}.{}.\N.\N|
00000080 09 5c 4e 0a |.\N.|
00000084
[ste...@xxx:/tmp]% hexdump -C badness.out
00000000 31 09 32 33 37 37 35 31 30 09 46 4f 55 52 53 51 |1.2377510.FOURSQ|
00000010 55 41 52 45 09 31 34 30 33 35 30 34 09 50 69 7a |UARE.1403504.Piz|
00000020 7a 61 20 48 75 74 09 7b 7d 09 5c 4e 09 7b 7d 09 |za Hut.{}.\N.{}.|
00000030 7b 7d 09 7b 7d 09 7b 50 69 7a 7a 61 7d 09 7b cf |{}.{}.{Pizza}.{.|
00000040 80 ce af cf 84 cf 83 ce b1 2c 68 75 74 2c 66 6f |.........,hut,fo|
00000050 6f 64 2c ce b6 cf 89 ce b3 cf 81 ce ac cf 86 ce |od,.............|
00000060 bf cf 2c 70 69 7a 7a 61 2c 65 61 74 2c 7a 6f 67 |..,pizza,eat,zog|
00000070 72 61 66 6f 75 7d 09 7b 7d 09 5c 4e 09 5c 4e 09 |rafou}.{}.\N.\N.|
00000080 5c 4e 0a |\N.|
Note offset 0x62:
00000060 bf cf 85 2c 70 69 7a 7a 61 2c 65 61 74 2c 7a 6f |...,pizza,eat,zo|
00000060 bf cf 2c 70 69 7a 7a 61 2c 65 61 74 2c 7a 6f 67 |..,pizza,eat,zog|
The 0xCF85 multibyte UTF-8 character was truncated to 0xCF! When I try to
reimport it, it goes in as CF2C (the 2C is the following comma) which is not
valid and matches the error printed by the client.
I can reproduce this problem using the JDBC driver as well, so I do not think
it is a PSQL bug but instead a PostgreSQL backend problem. I could be wrong,
of course.
Interestingly enough, this problem *does not* happen on a different machine -
PostgreSQL 8.4.4 on x86_64-pc-linux-gnu, compiled by GCC gcc-4.4.real (Ubuntu
4.4.3-4ubuntu5) 4.4.3, 64-bit
(stock Ubuntu package)
I can reliably reproduce it on two different Macs, though.
As with all random email posters, I'm not actually subscribed to the list -
please keep me on the CC for replies.
I would very much appreciate help tracking this down! Thanks for your time :)
Steven Schlansker
-- Sent via pgsql-bugs mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-bugs
