Hi,

Currently, unaccent extension only allows replacing one source character with one or more target characters. In Arabic, Hebrew and possibly other languages, diacritics are standalone characters that are being added to normal letters. To use unaccent dictionary for these languages, we need to allow empty targets to remove diacritics instead of replacing them.

The attached patch modfies unaacent.c so that dictionary parser uses zero-length target when the line has no target.

Best Regards,

Mohammad Alhashash

diff --git a/contrib/unaccent/unaccent.c b/contrib/unaccent/unaccent.c
old mode 100644
new mode 100755
index a337df6..4e72829
--- a/contrib/unaccent/unaccent.c
+++ b/contrib/unaccent/unaccent.c
@@ -58,7 +58,9 @@ placeChar(TrieChar *node, unsigned char *str, int lenstr, 
char *replaceTo, int r
                {
                        curnode->replacelen = replacelen;
                        curnode->replaceTo = palloc(replacelen);
-                       memcpy(curnode->replaceTo, replaceTo, replacelen);
+                       /* palloc(0) returns a valid address, not NULL */
+                       if (replaceTo) /* memcpy() is undefined for NULL 
pointers*/
+                               memcpy(curnode->replaceTo, replaceTo, 
replacelen);
                }
        }
        else
@@ -105,10 +107,10 @@ initTrie(char *filename)
                        while ((line = tsearch_readline(&trst)) != NULL)
                        {
                                /*
-                                * The format of each line must be "src trg" 
where src and trg
+                                * The format of each line must be "src [trg]" 
where src and trg
                                 * are sequences of one or more non-whitespace 
characters,
                                 * separated by whitespace.  Whitespace at 
start or end of
-                                * line is ignored.
+                                * line is ignored. If no trg added, a 
zero-length string is used.
                                 */
                                int                     state;
                                char       *ptr;
@@ -160,6 +162,13 @@ initTrie(char *filename)
                                        }
                                }
 
+                               /* if no trg (loop stops at state 1 or 2), use 
zero-length target */
+                               if (state == 1 || state == 2)
+                               {
+                                       trglen = 0;
+                                       state = 5;
+                               }
+                               
                                if (state >= 3)
                                        rootTrie = placeChar(rootTrie,
                                                                                
 (unsigned char *) src, srclen,
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to