On Sun, Nov 6, 2011 at 1:18 PM, Florian Pflug <f...@phlo.org> wrote: > > What's the locale of the database you're seeing this in, and which charset > does it use? > > I think scanf() uses isspace() and friends, and last time I looked the > locale definitions where all pretty bogus on OSX. So maybe scanf() somehow > decides that 0xA0 is whitespace. >
Ah, that does it: the locale I was using in the test code was just plain ol' C locale, while in the database it was en_CA.UTF-8. Changing the locale in my test code produced the wonky results. I should have figured it was a locale problem. Sure enough, in a UTF-8 locale, it believes that both 0xa0 and 0x85 are spaces. Pretty wonky behaviour indeed. Apparently this is a known OSX issue that has its roots in and older version of FreeBSD's libc I guess, eh? I've found various bug reports that allude to the problem and they all seem to point that way. I've attached a patch against master for unaccent.c that uses swscanf along with char2wchar and wchar2char instead of sscanf directly to initialize the unaccent extension and it appears to fix the problem in both the master and 9.1 branches. I haven't added any tests in the expected output file 'cause I'm not exactly sure what I should be testing against, but I could take a crack at that, too, if the patch looks reasonable and is usable. Cheers.
0001-Fix-weirdness-when-dealing-with-UTF-8-in-buggy-libc-.patch
Description: Binary data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers