[HACKERS] PATCH: Allow empty targets in unaccent dictionary

Mohammad Alhashash Sat, 19 Apr 2014 16:08:18 -0700

Hi,

Currently, unaccent extension only allows replacing one source characterwith one or more target characters. In Arabic, Hebrew and possibly otherlanguages, diacritics are standalone characters that are being added tonormal letters. To use unaccent dictionary for these languages, we needto allow empty targets to remove diacritics instead of replacing them.

The attached patch modfies unaacent.c so that dictionary parser useszero-length target when the line has no target.


Best Regards,

Mohammad Alhashash

diff --git a/contrib/unaccent/unaccent.c b/contrib/unaccent/unaccent.c
old mode 100644
new mode 100755
index a337df6..4e72829
--- a/contrib/unaccent/unaccent.c
+++ b/contrib/unaccent/unaccent.c
@@ -58,7 +58,9 @@ placeChar(TrieChar *node, unsigned char *str, int lenstr, 
char *replaceTo, int r
                {
                        curnode->replacelen = replacelen;
                        curnode->replaceTo = palloc(replacelen);
-                       memcpy(curnode->replaceTo, replaceTo, replacelen);
+                       /* palloc(0) returns a valid address, not NULL */
+                       if (replaceTo) /* memcpy() is undefined for NULL 
pointers*/
+                               memcpy(curnode->replaceTo, replaceTo, 
replacelen);
                }
        }
        else
@@ -105,10 +107,10 @@ initTrie(char *filename)
                        while ((line = tsearch_readline(&trst)) != NULL)
                        {
                                /*
-                                * The format of each line must be "src trg" 
where src and trg
+                                * The format of each line must be "src [trg]" 
where src and trg
                                 * are sequences of one or more non-whitespace 
characters,
                                 * separated by whitespace.  Whitespace at 
start or end of
-                                * line is ignored.
+                                * line is ignored. If no trg added, a 
zero-length string is used.
                                 */
                                int                     state;
                                char       *ptr;
@@ -160,6 +162,13 @@ initTrie(char *filename)
                                        }
                                }
 
+                               /* if no trg (loop stops at state 1 or 2), use 
zero-length target */
+                               if (state == 1 || state == 2)
+                               {
+                                       trglen = 0;
+                                       state = 5;
+                               }
+                               
                                if (state >= 3)
                                        rootTrie = placeChar(rootTrie,
                                                                                
 (unsigned char *) src, srclen,

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] PATCH: Allow empty targets in unaccent dictionary

Reply via email to