Re: [HACKERS] [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence

2010-08-20 Thread Steven Schlansker

On Aug 19, 2010, at 3:24 PM, Tom Lane wrote:
 Steven Schlansker ste...@trumpet.io writes:
 
 I'm not at all experienced with character encodings so I could
 be totally off base, but isn't it wrong to ever call isspace(0x85), 
 whatever the result may be, given that the actual character is 0xCF85?
 (U+03C5, GREEK SMALL LETTER UPSILON)
 
 We generally assume that in server-safe encodings, the ctype.h functions
 will behave sanely on any single-byte value.  You can argue the wisdom
 of that, but deciding to change that policy would be a rather massive
 code change; I'm not excited about going that direction.

Fair enough.  I presume there are no server-safe encodings for which
a multibyte sequence 0x XX20 would be valid - which would break anyway
(as the second byte looks like a real space)

 You need a setlocale() call, else the program acts as though it's in C
 locale regardless of environment.

Sigh.  I hate C sometimes. :-p

Anyway, it looks like this is actually a BSD bug which got copy +
pasted into Apple's Darwin source -

http://lists.freebsd.org/pipermail/freebsd-i18n/2007-September/000157.html

I have a couple of contacts at Apple so I'll see if there's any interest in
backporting a fix, but I wouldn't hope for it to happen quickly if at all...

Thanks for taking a look into fixing this, I hope you guys can reach
consensus on how to get it fixed :)

Best,
Steven Schlansker

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence

2010-08-20 Thread Steven Schlansker
On Aug 19, 2010, at 2:35 PM, Tom Lane wrote:

 Steven Schlansker ste...@trumpet.io writes:
 I'm having a rather annoying problem - a particular string is causing the 
 Postgres COPY functionality to lose a byte, causing data corruption in 
 backups and transferred data.
 
 I was able to reproduce this on my own Mac.  Some tracing shows that the
 problem is that isspace(0x85) returns true when in locale en_US.utf-8.
 This causes array_in to drop the final byte of the array element string,
 thinking that it's insignificant whitespace.

The 0x85 seems to be the second byte of a multibyte UTF-8
sequence.  I'm not at all experienced with character encodings so I could
be totally off base, but isn't it wrong to ever call isspace(0x85), 
whatever the result may be, given that the actual character is 0xCF85?
(U+03C5, GREEK SMALL LETTER UPSILON)


  I believe that you must
 not have produced the data file data.copy on a Mac, or at least not in
 that locale setting, because array_out should have double-quoted the
 array element given that behavior of isspace().

Correct, it was produced on a Linux machine.  That said, the charset
there was also UTF-8.

 
 Now, it's probably less than sane for isspace() to be behaving that way,
 since in a UTF8-based locale 0x85 can't be regarded as a valid character
 code at all.  But I'm not hopeful about the results of filing a bug with
 Apple, because their UTF8-based locales have a lot of other bu^H^Hdubious
 behaviors too, which they appear not to care much about.

I actually can't reproduce that behavior here:

#include ctype.h
#include stdio.h
int main() {
printf(%d\n, isspace(0x85));
return 0;
}

[ste...@xxx:~]% gcc -o test test.c
[ste...@xxx:~]% ./test
0
[ste...@xxx:~]% locale
LANG=en_US.utf-8
LC_COLLATE=en_US.utf-8
LC_CTYPE=en_US.utf-8
LC_MESSAGES=en_US.utf-8
LC_MONETARY=en_US.utf-8
LC_NUMERIC=en_US.utf-8
LC_TIME=en_US.utf-8
LC_ALL=
[ste...@xxx:~]% uname -a
Darwin xxx.local 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 
2010; root:xnu-1504.7.4~1/RELEASE_I386 i386 i386


Thanks much for your help,
Steven Schlansker


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers