Re: Remaining dependency on setlocale()

Peter Eisentraut Wed, 17 Dec 2025 02:39:25 -0800

On 12.12.25 21:11, Jeff Davis wrote:

case '\xc7':        /* C with cedilla */


so the premise that "fuzzystrmatch is designed for ASCII" does not
appear to be correct.  Needs more analysis.

(But apparently it's not multibyte aware at all, so I don't know what
to
do about that.)

I didn't notice that, thank you. Agreed, we need a bit more discussion
around this case as well as soundex().

Soundex is an ASCII-only algorithm, there is no expectation that thealgorithm does anything useful with non-ASCII characters, and it doesn'tdo so now. So I think using pg_ascii_toupper() is ok. (Users could forexample use unaccent to preprocess text.)

One might wonder if the presence of non-ASCII characters should be anerror, but that doesn't have to be the subject of this thread. Inoticed that the Wikipedia page for Soundex even calls out PostgreSQLfor doing things slightly different than everyone else, but I haven'tstudied the details.

For Metaphone, I found the reference implementation linked from itsWikipedia page, and it looks like our implementation is pretty closelyaligned to that. That reference implementation also contains theC-with-cedilla case explicitly. The correct fix here would probably beto change the implementation to work on wide characters. But I thinkfor the moment you could try a shortcut like, use pg_ascii_toupper(),but if the encoding is LATIN1 (or LATIN9 or whichever other encodingsalso contain C-with-cedilla at that code point), then explicitlyuppercase that one as well. This would preserve the existing behavior.

Note that the documentation calls out: "At present, the soundex,metaphone, dmetaphone, and dmetaphone_alt functions do not work wellwith multibyte encodings (such as UTF-8)."

Re: Remaining dependency on setlocale()

Reply via email to