Re: [HACKERS] [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence
Steven Schlansker ste...@trumpet.io writes: Anyway, it looks like this is actually a BSD bug which got copy + pasted into Apple's Darwin source - http://lists.freebsd.org/pipermail/freebsd-i18n/2007-September/000157.html I've applied a patch for this to HEAD 9.0: http://archives.postgresql.org/pgsql-committers/2010-08/msg00273.php I'm reluctant to back-patch it into already-released branches, though. Given the lack of prior reports, the odds of breaking something for somebody in a minor release seem to outweigh the odds of doing good. But you could easily drop it into a local copy of 8.4 if you wish. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence
On Aug 19, 2010, at 3:24 PM, Tom Lane wrote: Steven Schlansker ste...@trumpet.io writes: I'm not at all experienced with character encodings so I could be totally off base, but isn't it wrong to ever call isspace(0x85), whatever the result may be, given that the actual character is 0xCF85? (U+03C5, GREEK SMALL LETTER UPSILON) We generally assume that in server-safe encodings, the ctype.h functions will behave sanely on any single-byte value. You can argue the wisdom of that, but deciding to change that policy would be a rather massive code change; I'm not excited about going that direction. Fair enough. I presume there are no server-safe encodings for which a multibyte sequence 0x XX20 would be valid - which would break anyway (as the second byte looks like a real space) You need a setlocale() call, else the program acts as though it's in C locale regardless of environment. Sigh. I hate C sometimes. :-p Anyway, it looks like this is actually a BSD bug which got copy + pasted into Apple's Darwin source - http://lists.freebsd.org/pipermail/freebsd-i18n/2007-September/000157.html I have a couple of contacts at Apple so I'll see if there's any interest in backporting a fix, but I wouldn't hope for it to happen quickly if at all... Thanks for taking a look into fixing this, I hope you guys can reach consensus on how to get it fixed :) Best, Steven Schlansker -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence
On Aug 19, 2010, at 2:35 PM, Tom Lane wrote: Steven Schlansker ste...@trumpet.io writes: I'm having a rather annoying problem - a particular string is causing the Postgres COPY functionality to lose a byte, causing data corruption in backups and transferred data. I was able to reproduce this on my own Mac. Some tracing shows that the problem is that isspace(0x85) returns true when in locale en_US.utf-8. This causes array_in to drop the final byte of the array element string, thinking that it's insignificant whitespace. The 0x85 seems to be the second byte of a multibyte UTF-8 sequence. I'm not at all experienced with character encodings so I could be totally off base, but isn't it wrong to ever call isspace(0x85), whatever the result may be, given that the actual character is 0xCF85? (U+03C5, GREEK SMALL LETTER UPSILON) I believe that you must not have produced the data file data.copy on a Mac, or at least not in that locale setting, because array_out should have double-quoted the array element given that behavior of isspace(). Correct, it was produced on a Linux machine. That said, the charset there was also UTF-8. Now, it's probably less than sane for isspace() to be behaving that way, since in a UTF8-based locale 0x85 can't be regarded as a valid character code at all. But I'm not hopeful about the results of filing a bug with Apple, because their UTF8-based locales have a lot of other bu^H^Hdubious behaviors too, which they appear not to care much about. I actually can't reproduce that behavior here: #include ctype.h #include stdio.h int main() { printf(%d\n, isspace(0x85)); return 0; } [ste...@xxx:~]% gcc -o test test.c [ste...@xxx:~]% ./test 0 [ste...@xxx:~]% locale LANG=en_US.utf-8 LC_COLLATE=en_US.utf-8 LC_CTYPE=en_US.utf-8 LC_MESSAGES=en_US.utf-8 LC_MONETARY=en_US.utf-8 LC_NUMERIC=en_US.utf-8 LC_TIME=en_US.utf-8 LC_ALL= [ste...@xxx:~]% uname -a Darwin xxx.local 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4~1/RELEASE_I386 i386 i386 Thanks much for your help, Steven Schlansker -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence
Tatsuo Ishii is...@sraoss.co.jp writes: We generally assume that in server-safe encodings, the ctype.h functions will behave sanely on any single-byte value. I think this wisedom is only true for C locale. I'm not surprised all that it does not work with non C locales. From array_funcs.c: while (isspace((unsigned char) *p)) p++; IMO this should be something like: while (isspace((unsigned char) *p)) p += pg_mblen(p); I don't think that's likely to help at all. The risk is that isspace will do something not-sane with a fragment of a character. If it's not coded to guard against that, it's just as likely to give wrong results for the leading byte as for non-leading bytes. (In the case at hand, I think the underlying problem is that it imagines what it's given is a Unicode code point, not a byte of a UTF8 string. There apparently aren't any code points in the range U+00C0 - U+00FF for which isspace is true, but that's not true for isalpha for example.) If we were going to try to code around this, we'd need to change all these loops to look something like while ((isascii((unsigned char) *p) || pg_database_encoding_max_length() == 1) isspace((unsigned char) *p)) p += pg_mblen(p); // or p++, it wouldn't matter However, given the limited number of platforms where this is an issue and the fact that it is an acknowledged bug on those platforms, I'm not eager to go there. In any case, no matter whether we changed that or not, we'd still have the problem that it's a bad idea to have any locale-dependent behavior in array_in; and the behavior *would* still be locale-dependent, at least in single-byte encodings. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence
Steven Schlansker ste...@trumpet.io writes: On Aug 19, 2010, at 3:24 PM, Tom Lane wrote: We generally assume that in server-safe encodings, the ctype.h functions will behave sanely on any single-byte value. You can argue the wisdom of that, but deciding to change that policy would be a rather massive code change; I'm not excited about going that direction. Fair enough. I presume there are no server-safe encodings for which a multibyte sequence 0x XX20 would be valid - which would break anyway (as the second byte looks like a real space) Right: our definition of a server-safe encoding is precisely that no byte of a multibyte character looks like ASCII, ie all bytes have their high bit set. We're essentially assuming that the ctype.h functions will all return false for any byte with the high bit set, if the selected encoding is multibyte. Anyway, it looks like this is actually a BSD bug which got copy + pasted into Apple's Darwin source - http://lists.freebsd.org/pipermail/freebsd-i18n/2007-September/000157.html Interesting. So the BSD people did fix it upstream? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence
Steven Schlansker ste...@trumpet.io writes: I'm having a rather annoying problem - a particular string is causing the Postgres COPY functionality to lose a byte, causing data corruption in backups and transferred data. I was able to reproduce this on my own Mac. Some tracing shows that the problem is that isspace(0x85) returns true when in locale en_US.utf-8. This causes array_in to drop the final byte of the array element string, thinking that it's insignificant whitespace. I believe that you must not have produced the data file data.copy on a Mac, or at least not in that locale setting, because array_out should have double-quoted the array element given that behavior of isspace(). Now, it's probably less than sane for isspace() to be behaving that way, since in a UTF8-based locale 0x85 can't be regarded as a valid character code at all. But I'm not hopeful about the results of filing a bug with Apple, because their UTF8-based locales have a lot of other bu^H^Hdubious behaviors too, which they appear not to care much about. In any case, stepping back and taking a larger viewpoint, it seems unsafe for array_in/array_out to have any locale-sensitive behavior anyhow. As an example, in a LATINx locale it is entirely sane for isspace() to return true for 0xA0, while it should certainly not do so in C locale. This means we are at risk of data corruption, ie dropping a valid data character, when an array value starting or ending with 0xA0 is dumped from a C-locale database and loaded into a LATINx-locale one. So it seems like the safest answer is to modify array_in/array_out to use an ASCII-only definition of isspace(). I believe this is traditionally defined as space, tab, CR, LF, VT, FF. We could perhaps trim that further, like just space and tab, but there might be some risk of breaking client code that expects the other traditional whitespace to be ignored. I'm not sure if there are any other places with similar risks. hstore's I/O routines contain isspace calls, but I haven't analyzed the implications. There is an isspace call in record_out but it is just there for cosmetic purposes and doesn't protect any decisions in record_in, so I think it's okay if it makes platform/locale-dependent choices. Comments? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence
Steven Schlansker ste...@trumpet.io writes: On Aug 19, 2010, at 2:35 PM, Tom Lane wrote: I was able to reproduce this on my own Mac. Some tracing shows that the problem is that isspace(0x85) returns true when in locale en_US.utf-8. This causes array_in to drop the final byte of the array element string, thinking that it's insignificant whitespace. The 0x85 seems to be the second byte of a multibyte UTF-8 sequence. Check. I'm not at all experienced with character encodings so I could be totally off base, but isn't it wrong to ever call isspace(0x85), whatever the result may be, given that the actual character is 0xCF85? (U+03C5, GREEK SMALL LETTER UPSILON) We generally assume that in server-safe encodings, the ctype.h functions will behave sanely on any single-byte value. You can argue the wisdom of that, but deciding to change that policy would be a rather massive code change; I'm not excited about going that direction. I believe that you must not have produced the data file data.copy on a Mac, or at least not in that locale setting, because array_out should have double-quoted the array element given that behavior of isspace(). Correct, it was produced on a Linux machine. That said, the charset there was also UTF-8. Right ... but you had an isspace function that meets our expectations. I actually can't reproduce that behavior here: You need a setlocale() call, else the program acts as though it's in C locale regardless of environment. My test case looks like this: $ cat isspace.c #include stdio.h #include ctype.h #include locale.h int main() { int c; setlocale(LC_ALL, ); for (c = 1; c 256; c++) { if (isspace(c)) printf(%3o is space\n, c); } return 0; } $ gcc -O -Wall isspace.c $ LANG=C ./a.out 11 is space 12 is space 13 is space 14 is space 15 is space 40 is space $ LANG=en_US.utf-8 ./a.out 11 is space 12 is space 13 is space 14 is space 15 is space 40 is space 205 is space 240 is space $ regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers