On Nov6, 2011, at 18:43 , J Smith wrote: > I put some elog debugging lines into unaccent.c and found that sscanf > sometimes reads the scanned line by finding only one byte for the for > the source character rather than the two required for the complete > UTF-8 code point. It appears that the following characters are causing > the problem, along with the code points and such: > > 'Å' => 'A' | c3,85 => 41 > 'à' => 'a' | c3,a0 => 61 > 'ą' => 'a' | c4,85 => 61 > 'Ġ' => 'G' | c4,a0 => 47 > 'Ņ' => 'N' | c5,85 => 4e > 'Š' => 'S' | c5,a0 => 53 > > In each case, one byte was being read in the source string rather than > two, leading to the "duplicate TO" warnings above. This later leads to > the characters that produced the warning being ignored when unaccent > is called and left in the output.
What's the locale of the database you're seeing this in, and which charset does it use? I think scanf() uses isspace() and friends, and last time I looked the locale definitions where all pretty bogus on OSX. So maybe scanf() somehow decides that 0xA0 is whitespace. > I haven't been able to reproduce in a smaller example, and haven't > been able to reproduce on a CentOS server, so at this point I'm at a > loss as to the problem. Have you tried to set the same locale as postgres (using setlocale()) in your tests? best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers