On 23.12.25 21:09, Jeff Davis wrote:
On Wed, 2025-12-17 at 11:39 +0100, Peter Eisentraut wrote:
For Metaphone, I found the reference implementation linked from its
Wikipedia page, and it looks like our implementation is pretty
closely
aligned to that. That reference implementation also contains the
C-with-cedilla case explicitly. The correct fix here would probably
be
to change the implementation to work on wide characters. But I think
for the moment you could try a shortcut like, use pg_ascii_toupper(),
but if the encoding is LATIN1 (or LATIN9 or whichever other encodings
also contain C-with-cedilla at that code point), then explicitly
uppercase that one as well. This would preserve the existing
behavior.
Done, attached new patches.
Interestingly, WIN1256 encodes only the SMALL LETTER C WITH CEDILLA. I
think, for the purposes here, we can still consider it to "uppercase"
to \xc7, so that it can still be treated as the same sound. Technically
I think that would be an improvement over the current code in this edge
case, and suggests that case folding would be a better approach than
uppercasing.
On further reflection, it seems just as easy to have dmetaphone() take
the input collation and use that to do a proper collation-aware
upper-casing. This has the same effect (that is, it will still only
support certain single-byte encodings), but it avoids elaborately
hard-coding a bunch of things, and if we ever want to make this
multibyte-aware, then we'll have to go this way anyway, I think. See
attached patch.
From cd7fa005a286c9c4f4a27e8c61ca15787ee80bbd Mon Sep 17 00:00:00 2001
From: Peter Eisentraut <[email protected]>
Date: Tue, 6 Jan 2026 20:45:33 +0100
Subject: [PATCH] Make dmetaphone collation-aware
---
contrib/fuzzystrmatch/dmetaphone.c | 27 ++++++++++++++++-----------
1 file changed, 16 insertions(+), 11 deletions(-)
diff --git a/contrib/fuzzystrmatch/dmetaphone.c
b/contrib/fuzzystrmatch/dmetaphone.c
index 227d8b11ddc..062667527c2 100644
--- a/contrib/fuzzystrmatch/dmetaphone.c
+++ b/contrib/fuzzystrmatch/dmetaphone.c
@@ -99,6 +99,7 @@ The remaining code is authored by Andrew Dunstan
<[email protected]> and
#include "postgres.h"
#include "utils/builtins.h"
+#include "utils/formatting.h"
/* turn off assertions for embedded function */
#define NDEBUG
@@ -117,7 +118,7 @@ The remaining code is authored by Andrew Dunstan
<[email protected]> and
#include <ctype.h>
/* prototype for the main function we got from the perl module */
-static void DoubleMetaphone(char *str, char **codes);
+static void DoubleMetaphone(const char *str, Oid collid, char **codes);
#ifndef DMETAPHONE_MAIN
@@ -142,7 +143,7 @@ dmetaphone(PG_FUNCTION_ARGS)
arg = PG_GETARG_TEXT_PP(0);
aptr = text_to_cstring(arg);
- DoubleMetaphone(aptr, codes);
+ DoubleMetaphone(aptr, PG_GET_COLLATION(), codes);
code = codes[0];
if (!code)
code = "";
@@ -171,7 +172,7 @@ dmetaphone_alt(PG_FUNCTION_ARGS)
arg = PG_GETARG_TEXT_PP(0);
aptr = text_to_cstring(arg);
- DoubleMetaphone(aptr, codes);
+ DoubleMetaphone(aptr, PG_GET_COLLATION(), codes);
code = codes[1];
if (!code)
code = "";
@@ -278,13 +279,17 @@ IncreaseBuffer(metastring *s, int chars_needed)
}
-static void
-MakeUpper(metastring *s)
+static metastring *
+MakeUpper(metastring *s, Oid collid)
{
- char *i;
+ char *newstr;
+ metastring *newms;
+
+ newstr = str_toupper(s->str, s->length, collid);
+ newms = NewMetaString(newstr);
+ DestroyMetaString(s);
- for (i = s->str; *i; i++)
- *i = toupper((unsigned char) *i);
+ return newms;
}
@@ -392,7 +397,7 @@ MetaphAdd(metastring *s, const char *new_str)
static void
-DoubleMetaphone(char *str, char **codes)
+DoubleMetaphone(const char *str, Oid collid, char **codes)
{
int length;
metastring *original;
@@ -414,7 +419,7 @@ DoubleMetaphone(char *str, char **codes)
primary->free_string_on_destroy = 0;
secondary->free_string_on_destroy = 0;
- MakeUpper(original);
+ original = MakeUpper(original, collid);
/* skip these when at start of word */
if (StringAt(original, 0, 2, "GN", "KN", "PN", "WR", "PS", ""))
@@ -1430,7 +1435,7 @@ main(int argc, char **argv)
if (argc > 1)
{
- DoubleMetaphone(argv[1], codes);
+ DoubleMetaphone(argv[1], DEFAULT_COLLATION_OID, codes);
printf("%s|%s\n", codes[0], codes[1]);
}
}
--
2.52.0