On 23.12.25 21:09, Jeff Davis wrote:
On Wed, 2025-12-17 at 11:39 +0100, Peter Eisentraut wrote:
For Metaphone, I found the reference implementation linked from its
Wikipedia page, and it looks like our implementation is pretty
closely
aligned to that.  That reference implementation also contains the
C-with-cedilla case explicitly.  The correct fix here would probably
be
to change the implementation to work on wide characters.  But I think
for the moment you could try a shortcut like, use pg_ascii_toupper(),
but if the encoding is LATIN1 (or LATIN9 or whichever other encodings
also contain C-with-cedilla at that code point), then explicitly
uppercase that one as well.  This would preserve the existing
behavior.

Done, attached new patches.

Interestingly, WIN1256 encodes only the SMALL LETTER C WITH CEDILLA. I
think, for the purposes here, we can still consider it to "uppercase"
to \xc7, so that it can still be treated as the same sound. Technically
I think that would be an improvement over the current code in this edge
case, and suggests that case folding would be a better approach than
uppercasing.

On further reflection, it seems just as easy to have dmetaphone() take the input collation and use that to do a proper collation-aware upper-casing. This has the same effect (that is, it will still only support certain single-byte encodings), but it avoids elaborately hard-coding a bunch of things, and if we ever want to make this multibyte-aware, then we'll have to go this way anyway, I think. See attached patch.
From cd7fa005a286c9c4f4a27e8c61ca15787ee80bbd Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <[email protected]>
Date: Tue, 6 Jan 2026 20:45:33 +0100
Subject: [PATCH] Make dmetaphone collation-aware

---
 contrib/fuzzystrmatch/dmetaphone.c | 27 ++++++++++++++++-----------
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/contrib/fuzzystrmatch/dmetaphone.c 
b/contrib/fuzzystrmatch/dmetaphone.c
index 227d8b11ddc..062667527c2 100644
--- a/contrib/fuzzystrmatch/dmetaphone.c
+++ b/contrib/fuzzystrmatch/dmetaphone.c
@@ -99,6 +99,7 @@ The remaining code is authored by Andrew Dunstan 
<[email protected]> and
 #include "postgres.h"
 
 #include "utils/builtins.h"
+#include "utils/formatting.h"
 
 /* turn off assertions for embedded function */
 #define NDEBUG
@@ -117,7 +118,7 @@ The remaining code is authored by Andrew Dunstan 
<[email protected]> and
 #include <ctype.h>
 
 /* prototype for the main function we got from the perl module */
-static void DoubleMetaphone(char *str, char **codes);
+static void DoubleMetaphone(const char *str, Oid collid, char **codes);
 
 #ifndef DMETAPHONE_MAIN
 
@@ -142,7 +143,7 @@ dmetaphone(PG_FUNCTION_ARGS)
        arg = PG_GETARG_TEXT_PP(0);
        aptr = text_to_cstring(arg);
 
-       DoubleMetaphone(aptr, codes);
+       DoubleMetaphone(aptr, PG_GET_COLLATION(), codes);
        code = codes[0];
        if (!code)
                code = "";
@@ -171,7 +172,7 @@ dmetaphone_alt(PG_FUNCTION_ARGS)
        arg = PG_GETARG_TEXT_PP(0);
        aptr = text_to_cstring(arg);
 
-       DoubleMetaphone(aptr, codes);
+       DoubleMetaphone(aptr, PG_GET_COLLATION(), codes);
        code = codes[1];
        if (!code)
                code = "";
@@ -278,13 +279,17 @@ IncreaseBuffer(metastring *s, int chars_needed)
 }
 
 
-static void
-MakeUpper(metastring *s)
+static metastring *
+MakeUpper(metastring *s, Oid collid)
 {
-       char       *i;
+       char       *newstr;
+       metastring *newms;
+
+       newstr = str_toupper(s->str, s->length, collid);
+       newms = NewMetaString(newstr);
+       DestroyMetaString(s);
 
-       for (i = s->str; *i; i++)
-               *i = toupper((unsigned char) *i);
+       return newms;
 }
 
 
@@ -392,7 +397,7 @@ MetaphAdd(metastring *s, const char *new_str)
 
 
 static void
-DoubleMetaphone(char *str, char **codes)
+DoubleMetaphone(const char *str, Oid collid, char **codes)
 {
        int                     length;
        metastring *original;
@@ -414,7 +419,7 @@ DoubleMetaphone(char *str, char **codes)
        primary->free_string_on_destroy = 0;
        secondary->free_string_on_destroy = 0;
 
-       MakeUpper(original);
+       original = MakeUpper(original, collid);
 
        /* skip these when at start of word */
        if (StringAt(original, 0, 2, "GN", "KN", "PN", "WR", "PS", ""))
@@ -1430,7 +1435,7 @@ main(int argc, char **argv)
 
        if (argc > 1)
        {
-               DoubleMetaphone(argv[1], codes);
+               DoubleMetaphone(argv[1], DEFAULT_COLLATION_OID, codes);
                printf("%s|%s\n", codes[0], codes[1]);
        }
 }
-- 
2.52.0

Reply via email to