Re: [GENERAL] Trouble with UTF-8 data

2008-01-21 Thread Albe Laurenz
Janine Sisk wrote:
 0xEDA7A1 (UTF-8) corresponds to UNICODE code point 0xD9E1, which,
 when interpreted as a high surrogare and followed by a low surrogate,
 would correspond to the UTF-16 encoding of a code point
 between 0x88400 and 0x887FF (depending on the value of the low surrogate).

 These code points do not correspond to any valid character.
 So - unless there is a flaw in my reasoning - there's something
 fishy with these data anyway.

 Janine, could you give us a hex dump of that line from the copy statement?
 
 Certainly.  Do you want to see it as it came from the old database,  
 or after I ran it through iconv?  Although iconv wasn't able to solve  
 this problem it did fix others in other tables;  unfortunately I have  
 no way of knowing if it also mangled some data at the same time.

Both; but the before dump is of course more likely to give a clue.

Yours,
Laurenz Albe

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org/


Re: [GENERAL] Trouble with UTF-8 data

2008-01-18 Thread Albe Laurenz
Tom Lane wrote:
 But I'm still getting this error when loading the data into the new  
 database:
 
 ERROR:  invalid byte sequence for encoding UTF8: 0xeda7a1
 
 The reason PG doesn't like this sequence is that it corresponds to
 a Unicode surrogate pair code point, which is not supposed to
 ever appear in UTF-8 representation --- surrogate pairs are a kluge for
 UTF-16 to deal with Unicode code points of more than 16 bits.

0xEDA7A1 (UTF-8) corresponds to UNICODE code point 0xD9E1, which,
when interpreted as a high surrogare and followed by a low surrogate,
would correspond to the UTF-16 encoding of a code point
between 0x88400 and 0x887FF (depending on the value of the low surrogate).

These code points do not correspond to any valid character.
So - unless there is a flaw in my reasoning - there's something
fishy with these data anyway.

Janine, could you give us a hex dump of that line from the copy statement?

Yours,
Laurenz Albe

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [GENERAL] Trouble with UTF-8 data

2008-01-18 Thread Janine Sisk

On Jan 18, 2008, at 12:00 AM, Albe Laurenz wrote:


0xEDA7A1 (UTF-8) corresponds to UNICODE code point 0xD9E1, which,
when interpreted as a high surrogare and followed by a low surrogate,
would correspond to the UTF-16 encoding of a code point
between 0x88400 and 0x887FF (depending on the value of the low  
surrogate).


These code points do not correspond to any valid character.
So - unless there is a flaw in my reasoning - there's something
fishy with these data anyway.

Janine, could you give us a hex dump of that line from the copy  
statement?


Certainly.  Do you want to see it as it came from the old database,  
or after I ran it through iconv?  Although iconv wasn't able to solve  
this problem it did fix others in other tables;  unfortunately I have  
no way of knowing if it also mangled some data at the same time.


The version of iconv I have does know about UTF16 so I tried using  
that as the from encoding instead of UTF8, but the result had new  
errors in places where the original data was good, so that was  
obviously a step backwards.


BTW, in case it matters I found out I misidentified the version of PG  
this data came from - it's actually 7.3.6.


thanks,

janine


---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] Trouble with UTF-8 data

2008-01-17 Thread Tom Lane
Janine Sisk [EMAIL PROTECTED] writes:
 But I'm still getting this error when loading the data into the new  
 database:

 ERROR:  invalid byte sequence for encoding UTF8: 0xeda7a1

The reason PG doesn't like this sequence is that it corresponds to
a Unicode surrogate pair code point, which is not supposed to
ever appear in UTF-8 representation --- surrogate pairs are a kluge for
UTF-16 to deal with Unicode code points of more than 16 bits.  See

http://en.wikipedia.org/wiki/UTF-16

I think you need a version of iconv that knows how to fold surrogate
pairs into proper UTF-8 form.  It might also be that the data is
outright broken --- if this sequence isn't followed by another
surrogate-pair sequence then it isn't valid Unicode by anybody's
interpretation.

7.2.x unfortunately didn't check Unicode data carefully, and would
have let this data pass without comment ...

regards, tom lane

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[GENERAL] Trouble with UTF-8 data

2008-01-17 Thread Janine Sisk

Hi all,

I'm moving a database from PG 7.2.4 to 8.2.6.  I have already run  
iconv on the dump file like so:


iconv -c -f UTF-8 -t UTF-8 -o out.dmp in.dmp

But I'm still getting this error when loading the data into the new  
database:


ERROR:  invalid byte sequence for encoding UTF8: 0xeda7a1
HINT:  This error can also happen if the byte sequence does not match  
the encoding expected by the server, which is controlled by  
client_encoding.

CONTEXT:  COPY article, line 2

FWIW this is the second database I've moved this way and for the  
first one, iconv fixed all the byte sequence errors.  No such luck  
this time.


The 7.2.4 database has encoding UNICODE, and the 8.2.6 one is in UTF-8.

To make matters even more fun, the data is in Traditional Chinese  
characters, which I don't read, so there seems to be no way for me to  
identify the problem bits.  I've loaded the dump file into a hex  
editor and searched for the value that's reported as the problem but  
it's not in the file.


Is there anything I can do to fix this?

Thanks in advance,

janine


---(end of broadcast)---
TIP 6: explain analyze is your friend