Re: [HACKERS] Differences in UTF8 between 8.0 and 8.1

2005-11-01 Thread Paul Lindner
On Sun, Oct 30, 2005 at 11:49:41AM -0500, Gregory Maxwell wrote: > On 10/26/05, Christopher Kings-Lynne <[EMAIL PROTECTED]> wrote: > > > iconv -c -f UTF8 -t UTF8 > > recode UTF-8..UTF-8 < dump_in.sql > dump_out.sql > > I've got a file with characters that pg won't accept that recode does > not f

Re: [HACKERS] Differences in UTF8 between 8.0 and 8.1

2005-10-30 Thread Gregory Maxwell
On 10/26/05, Christopher Kings-Lynne <[EMAIL PROTECTED]> wrote: > > iconv -c -f UTF8 -t UTF8 > recode UTF-8..UTF-8 < dump_in.sql > dump_out.sql I've got a file with characters that pg won't accept that recode does not fix but iconv does. Iconv is fine for my application, so I'm just posting to t

Re: [HACKERS] Differences in UTF8 between 8.0 and 8.1

2005-10-27 Thread Andrew - Supernews
On 2005-10-27, Paul Lindner <[EMAIL PROTECTED]> wrote: > On Mon, Oct 24, 2005 at 05:07:40AM -, Andrew - Supernews wrote: >> I'm inclined to suspect that the whole sequence c1 f9 d4 c2 d0 c7 d2 b9 >> was never actually a valid utf-8 string, and that the d2 b9 is only valid >> by coincidence (it'

Re: [HACKERS] Differences in UTF8 between 8.0 and 8.1

2005-10-26 Thread jtv
Andrej Ricnik-Bay wrote: > How about an ugly kludge ... > > split -a 3 -d -b 1048576 ../path/to/dumpfile dumpfile > for i in `ls -1 dumpfile*`; do iconv -c -f UTF8 -t UTF8 $i;done > cat dumpfile* > new_dump Not with UTF-8... You might break in the middle of a multibyte character. Jeroen

Re: [HACKERS] Differences in UTF8 between 8.0 and 8.1

2005-10-26 Thread Christopher Kings-Lynne
However I'm running into another problem now. The command: iconv -c -f UTF8 -t UTF8 does strip out the invalid characters. However, iconv reads the entire file into memory before it writes out any data. This is not so good for multi-gigabyte dump files and doesn't allow for it to be used

Re: [HACKERS] Differences in UTF8 between 8.0 and 8.1

2005-10-26 Thread Andrej Ricnik-Bay
> does strip out the invalid characters. However, iconv reads the > entire file into memory before it writes out any data. This is not so > good for multi-gigabyte dump files and doesn't allow for it to be used > in a pipe between pg_dump and psql. > > Anyone have any other recommendations? GNU

Re: [HACKERS] Differences in UTF8 between 8.0 and 8.1

2005-10-26 Thread Paul Lindner
On Mon, Oct 24, 2005 at 05:07:40AM -, Andrew - Supernews wrote: > > I'm inclined to suspect that the whole sequence c1 f9 d4 c2 d0 c7 d2 b9 > was never actually a valid utf-8 string, and that the d2 b9 is only valid > by coincidence (it's a Cyrillic letter from Azerbaijani). I know the 8.0 >

Re: [HACKERS] Differences in UTF8 between 8.0 and 8.1

2005-10-23 Thread Andrew - Supernews
On 2005-10-24, Paul Lindner <[EMAIL PROTECTED]> wrote: > Here's a cut and paste from emacs hexl-mode: > > : 3530 3833 6335 3038 330a 3c20 5641 4c55 5083c5083.< VALU > 0010: 4553 2028 3230 3235 3533 2c20 27c1 f9d4 ES (202553, '... > 0020: c2d0 c7d2 b927 2c20 0a2d 2d2d 0a3e 2056 ..

Re: [HACKERS] Differences in UTF8 between 8.0 and 8.1

2005-10-23 Thread Christopher Kings-Lynne
Thanks go out to John Hansen, he recommended to run the dump through iconv: iconv -c -f UTF8 -t UTF8 -o fixed.sql dump.sql This seems to strip out invalid UTF8 and will allow for a clean import. Someone should add this to the Release Notes/FAQ.. Yes I think that's extremely important to put

Re: [HACKERS] Differences in UTF8 between 8.0 and 8.1

2005-10-23 Thread Paul Lindner
On Sun, Oct 23, 2005 at 05:56:50AM -, Andrew - Supernews wrote: > On 2005-10-22, Paul Lindner <[EMAIL PROTECTED]> wrote: > > I've generated dumps using pg_dump from 8.0 and 8.1. Attempting to > > restore these results in > > > > Invalid UNICODE byte sequence detected near byte ... > > What w

Re: [HACKERS] Differences in UTF8 between 8.0 and 8.1

2005-10-22 Thread Andrew - Supernews
On 2005-10-22, Paul Lindner <[EMAIL PROTECTED]> wrote: > I've generated dumps using pg_dump from 8.0 and 8.1. Attempting to > restore these results in > > Invalid UNICODE byte sequence detected near byte ... What were the exact offending bytes? > Question: > > Does the 8.1 Unicode sanity code a

[HACKERS] Differences in UTF8 between 8.0 and 8.1

2005-10-22 Thread Paul Lindner
I've been doing some test imports of UNICODE databases into Postgres 8.1beta3. The only problem I've seen is that some data from 8.0 databases will not import. I've generated dumps using pg_dump from 8.0 and 8.1. Attempting to restore these results in Invalid UNICODE byte sequence detected nea