Hi,
Currently, unaccent extension only allows replacing one source character
with one or more target characters. In Arabic, Hebrew and possibly other
languages, diacritics are standalone characters that are being added to
normal letters. To use unaccent dictionary for these languages, we need
to allow empty targets to remove diacritics instead of replacing them.
The attached patch modfies unaacent.c so that dictionary parser uses
zero-length target when the line has no target.
Best Regards,
Mohammad Alhashash
diff --git a/contrib/unaccent/unaccent.c b/contrib/unaccent/unaccent.c
old mode 100644
new mode 100755
index a337df6..4e72829
--- a/contrib/unaccent/unaccent.c
+++ b/contrib/unaccent/unaccent.c
@@ -58,7 +58,9 @@ placeChar(TrieChar *node, unsigned char *str, int lenstr,
char *replaceTo, int r
{
curnode->replacelen = replacelen;
curnode->replaceTo = palloc(replacelen);
- memcpy(curnode->replaceTo, replaceTo, replacelen);
+ /* palloc(0) returns a valid address, not NULL */
+ if (replaceTo) /* memcpy() is undefined for NULL
pointers*/
+ memcpy(curnode->replaceTo, replaceTo,
replacelen);
}
}
else
@@ -105,10 +107,10 @@ initTrie(char *filename)
while ((line = tsearch_readline(&trst)) != NULL)
{
/*
- * The format of each line must be "src trg"
where src and trg
+ * The format of each line must be "src [trg]"
where src and trg
* are sequences of one or more non-whitespace
characters,
* separated by whitespace. Whitespace at
start or end of
- * line is ignored.
+ * line is ignored. If no trg added, a
zero-length string is used.
*/
int state;
char *ptr;
@@ -160,6 +162,13 @@ initTrie(char *filename)
}
}
+ /* if no trg (loop stops at state 1 or 2), use
zero-length target */
+ if (state == 1 || state == 2)
+ {
+ trglen = 0;
+ state = 5;
+ }
+
if (state >= 3)
rootTrie = placeChar(rootTrie,
(unsigned char *) src, srclen,
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers