Re: [HACKERS] [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence

2010-08-21 Thread Tom Lane
Steven Schlansker writes: > Anyway, it looks like this is actually a BSD bug which got copy + > pasted into Apple's Darwin source - > http://lists.freebsd.org/pipermail/freebsd-i18n/2007-September/000157.html I've applied a patch for this to HEAD & 9.0: http://archives.postgresql.org/pgsql-commit

Re: [HACKERS] [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence

2010-08-20 Thread Tom Lane
Steven Schlansker writes: > On Aug 19, 2010, at 3:24 PM, Tom Lane wrote: >> We generally assume that in server-safe encodings, the ctype.h functions >> will behave sanely on any single-byte value. You can argue the wisdom >> of that, but deciding to change that policy would be a rather massive >>

Re: [HACKERS] [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence

2010-08-20 Thread Tom Lane
Tatsuo Ishii writes: >> We generally assume that in server-safe encodings, the ctype.h functions >> will behave sanely on any single-byte value. > I think this "wisedom" is only true for C locale. I'm not surprised > all that it does not work with non C locales. > From array_funcs.c: >

Re: [HACKERS] [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence

2010-08-19 Thread Steven Schlansker
On Aug 19, 2010, at 2:35 PM, Tom Lane wrote: > Steven Schlansker writes: >> I'm having a rather annoying problem - a particular string is causing the >> Postgres COPY functionality to lose a byte, causing data corruption in >> backups and transferred data. > > I was able to reproduce this on m

Re: [HACKERS] [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence

2010-08-19 Thread Steven Schlansker
On Aug 19, 2010, at 3:24 PM, Tom Lane wrote: > Steven Schlansker writes: >> >> I'm not at all experienced with character encodings so I could >> be totally off base, but isn't it wrong to ever call isspace(0x85), >> whatever the result may be, given that the actual character is 0xCF85? >> (U+03

Re: [HACKERS] [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence

2010-08-19 Thread Tom Lane
Steven Schlansker writes: > On Aug 19, 2010, at 2:35 PM, Tom Lane wrote: >> I was able to reproduce this on my own Mac. Some tracing shows that the >> problem is that isspace(0x85) returns true when in locale en_US.utf-8. >> This causes array_in to drop the final byte of the array element string,

Re: [HACKERS] [BUGS] COPY FROM/TO losing a single byte of a multibyte UTF-8 sequence

2010-08-19 Thread Tom Lane
Steven Schlansker writes: > I'm having a rather annoying problem - a particular string is causing the > Postgres COPY functionality to lose a byte, causing data corruption in > backups and transferred data. I was able to reproduce this on my own Mac. Some tracing shows that the problem is that