Just to recap the reason for the change to the endings algorithm in
version 3.1.3 back in 1999, Steve Arlow wrote:
> I do consulting for a number of law firms, and quickly discovered a
> problem with htfuzzy matching on the word "witness".  (There are 
> three root words in the distribution dictionary that end in "-ness"
> and also certainly exhibit this problem; the other two are
> "highness" and "likeness".  Other words can also be argued about.)
> 
> The fix (which does not appear to break anything else AFAICT, but 
> may have a small effect on performance) is to add a preliminary check
> on root2word before trying word2root.  The code is below (from the
> file htdig-3.1.2/htfuzzy/Endings.cc), optimize it to your taste.

[The code Steve posted was the getWords() function as it currently stands.]

Later, in response to Geoff's request for more information, Steve posted
this:
> Words of the form XXXness which are not a form of the word XXX.  If I
> enter "witness" into htdig with matching for alternate endings enabled,
> it will look for "wit", "wits", or "witness".  What it should really be
> looking for is "witness", "witnessed", "witnessing", or "witnesses".
> 
> A similar problem might occur with other suffixes, but I can't think of
> an example off the top of my head.
> 
> The fix is to try to interpret each term as a root word before trying
> to interpret it as an alternate form.

The details are on the old bug tracking database at
http://cgi.htdig.org/cgi-bin/htdig3/resolved?id=560;user=guest;selectid=560

Now, on to the current thread, I wrote:
> According to Alexander I. Lebedev:
> > Gilles,
> > 
> > Thank you for your answer.
> 
> It was Geoff who answered you last time, but I'm sure you're welcome.  :-)
> 
> > >To quote from the documentation: (attrs.html#search_algorithm)
> > > Each word is first reduced to its word root and then all known legal
> > > endings are used for the matching.
> > >
> > >I think the bug basically comes up because there are some subset of
> > >permuations that are also root words. In Endings::getWords, if a word is
> > >already a root word, then it doesn't bother to check if it's also a
> > >permutation.
> > 
> > I'm afraid, the origin of the bug is different.  I tested your idea on
> > one indexed Russian site (26,000 documents) and found the same bug
> > in the case when the word I'm searching for is not a root itself (but
> > have two different roots).  So I guess, the program stops searching
> > when it finds the first occurence of the word, not all of them. (Indeed,
> > in Endings::getWords I don't see the loop that tests if there are other
> > roots.)
> > 
> > - Alexander
> 
> What Geoff is describing as a bug in the Endings algorithm is actually a
> deliberate change, submitted by Steve Arlow back in June 1999.  It was to
> prevent the -ness suffix from being stripped on words like witness, and then
> having the the word "wit" expanded with a number of inappropriate suffixes.
> That change was incorporated in version 3.1.3.  However, it does indeed
> appear to be the cause of the problem with the whole "skate" vs. "skater"
> test.
> 
> I checked htsearch from 3.1.2, and its unpatched endings algorithm produced
> the same results with "skate" or "skater", i.e.
> 
>    (skater or skate or skated or skating or skates or skaters)
> 
> So, either the problem you've run into with Russian words is different than
> the skater test, or going back to 3.1.2 will solve the problem for you.
> 
> In either case, I think Steve Arlow's patch is more far-reaching than any
> of us thought before.  It seems to me that this should be optional, unless
> we can find a smarter way of doing this.

Geoff wrote:
> I've been trying to come up with a smarter way, and I can't. So I'm
> suggesting something like endings_lookup_roots that will also look up
> root words. Even so, I don't know how useful it is--it seems like
> there have to be better ways of constructing the affix rules!

And finally, Alexander I. Lebedev wrote:
> Gilles and Geoff,
> 
> Thank you for your expalnation.  I've made the changes that "downgrade"
> the version (I've commented out the part of code that tests if the word
> is the root), and recompiled htdig.  The error with "scater" disappeared
> indeed, but the problem with Russian endings still exists.  I suppose the
> problem is in that the Endings::getWords function stops searching after
> the first occurrence of the word is found and doesn't check if the end
> of the database is reached.  (So the solution may be quite easy: to add
> a loop that looks for a word until the end of the database is reached,
> and if the word is not found, suppose that it's a root itself.)
> 
> I'd like to add a few words about the changes made in version 3.1.3.
> Don't mix two problems: the quality of the search engine (search algorithm)
> and the quality of ispell dictionaries.  Ispell dictionaries were created
> to solve _another_ problem, and the munchlist approach used in it is the way
> to optimize the dictionary structure for that purpose.
> 
> The dictionary we need for htdig is a different thing as it should be
> organized as a list of records that correspond to the grammar rules.
> (BTW, my Russian ispell dictionaries don't use munchlist approach at all
> and strictly follow the grammar.)  I think it would be better to leave
> the correctness of the search algorithm untouched (as in version 3.1.2)
> and make changes to the English word dictionary to tailor the htdig needs.
> So, if you find that the -ness ending is harmful, it's better to create
> a patch that corrects the english.0 file (the original file should be
> distributed in its original form) and do not change the correct search
> algorithm.

Geoff's comment about better ways of constructing the affix rules
got me exploring deeper into what was causing Steve's problem.
Alexander is right in that this is a problem with the affix file,
and not the algorithm.  Ispell doesn't care that wit isn't the root
word of witness, as long as its rules don't lead to misspelled words.
The solution, of course, it merely to remove the "P" flag from wit/MPS
entry in english.0 and rebuild the endings database.  You could do the
same for "high" and "like", if those are a problem too.  The change to
the algorithm was far too broad, and it led to the problem of words like
"skater" not giving us words derived from its root, "skate".

Alexander is also right that getWords() only handles one root for any
given word, but adding a loop to it isn't quite enough.  The createRoot()
function must also be fixed, because right now if two roots can yield
the same word, only the last root encountered is kept as a record for
the derived word in the word2root database.  This code needs to be
patched to append to existing records, or the added loop in getWords()
will do nothing.

However, Steve did correctly point out that the old getWords() algorithm
should try to expand a word as a root, for words derived from it, even
if the word has a root.  The mistake was in failing also to expand the
root if it had one, in cases where words are derived from the root.
To use the "witness" example, if "wit" is taken as the root of witness,
then only wit is expanded, and witness isn't.  Care has to be taken,
though, to avoid putting duplicates in the list because different roots
can lead to the same derived word.

So, here is a moderately tested patch for htfuzzy/htsearch 3.1.5 to
restore pre-3.1.3 functionality to the endings algorithm, so that other
derivations of a word's root are looked up, and additionally it will
handle words with more than one root and look up all the words derived
from these roots, as well as the word itself.  It also fixes the 3 suffix
problems Steve reported without cutting out derivations we want to keep,
by correcting the english.0 definitions for these words.

Apply it in the main htdig-3.1.5 source directory using the command
"patch -p0 < this-message".  Then make and make install.  You may need
to manually copy the installdir/english.0 file to your common directory,
before you do a "htfuzzy endings".

Please give this a shot and let me know how it works out.

--- htfuzzy/Endings.cc.orig     Wed Sep  1 14:48:32 1999
+++ htfuzzy/Endings.cc  Wed Jun 13 16:47:49 2001
@@ -20,6 +20,7 @@
 static char RCSid[] = "$Id: Endings.cc,v 1.2.2.1 1999/09/01 19:50:59 grdetil Exp $";
 #endif
 
+#include "StringList.h"
 #include "Endings.h"
 #include "htfuzzy.h"
 #include <Configuration.h>
@@ -72,53 +73,55 @@ Endings::getWords(char *w, List &words)
     String     word = w;
     word.lowercase();
        
-    if (root2word->Get(word, data) == OK)
+    //
+    // Look for word's root(s).  Some words may have more than one root,
+    // so handle them all.  Whether or not a word has a root, it's assumed
+    // to be root in itself.
+    //
+    if (word2root->Get(word, data) == OK)
     {
-       //
-       // Found the root's permutations
-       //
-       char    *token = strtok(data.get(), " ");
-       while (token)
-       {
-           if (mystrcasecmp(token, w) != 0)
-           {
-               words.Add(new String(token));
-           }
-           token = strtok(0, " ");
-       }
+       word << ' ' << data;
     }
-    else
+    StringList roots(word, " ");
+    Object     *root;
+    roots.Start_Get();
+    while ((root = roots.Get_Next()) != 0)
     {
-       if (word2root->Get(word, data) == OK)
-       {
-           //
-           // Found the root of the word.  We'll add it to the list already
-           //
-           word = data;
-           words.Add(new String(word));
-       }
-       else
+       //
+       // Found a root.  Look for new words that have this root.
+       //
+       word = ((String *)root)->get();
+       if (root2word->Get(word, data) == OK)
        {
-           //
-           // The root wasn't found.  This could mean that the word
-           // is already the root.
-           //
+           word << ' ' << data;
        }
-
-       if (root2word->Get(word, data) == OK)
+       //
+       // Iterate through the root's permutations
+       //
+       char    *token = strtok(word.get(), " ");
+       while (token)
        {
-           //
-           // Found the root's permutations
-           //
-           char        *token = strtok(data.get(), " ");
-           while (token)
+           if (mystrcasecmp(token, w) != 0)
            {
-               if (mystrcasecmp(token, w) != 0)
+               //
+               // This permutation isn't the original word, so we add it
+               // to the list if it's not already there.
+               //
+               Object  *obj;
+               words.Start_Get();
+               while((obj = words.Get_Next()) != 0)
+               {
+                   if (mystrcasecmp(token, ((String *)obj)->get()) == 0)
+                   {
+                       break;
+                   }
+               }
+               if (obj == 0)
                {
                    words.Add(new String(token));
                }
-               token = strtok(0, " ");
            }
+           token = strtok(0, " ");
        }
     }
 }
--- htfuzzy/EndingsDB.cc.orig   Mon Mar 29 09:59:43 1999
+++ htfuzzy/EndingsDB.cc        Wed Jun 13 13:57:54 2001
@@ -189,7 +189,14 @@ Endings::createRoot(Dictionary &rules, c
        //
        for (int i = 0; i < wordList.Count(); i++)
        {
-           w2r->Put(((String *)wordList[i])->get(), word, strlen(input));
+           //
+           // Append to existing record if there is one.
+           //
+           data = "";
+           if (w2r->Get(((String *)wordList[i])->get(), data) == OK)
+               data << ' ';
+           data << word;
+           w2r->Put(((String *)wordList[i])->get(), data);
        }
     }
 
--- installdir/english.0.orig   Tue Feb 16 23:03:57 1999
+++ installdir/english.0        Wed Jun 13 14:40:07 2001
@@ -41246,7 +41246,7 @@ hierology
 hierophant
 hifalutin
 higgle/DGRS
-high/PRTY
+high/RTY
 highball
 highbinder
 highborn
@@ -48349,7 +48349,7 @@ lii
 likability
 likable/P
 likasi
-like/DGJPRSTY
+like/DGJRSTY
 likeable
 likelihood/SU
 likely/PRT
@@ -86638,7 +86638,7 @@ wist
 wistaria
 wisteria
 wistful/PY
-wit/MPS
+wit/MS
 witan
 witch/GS
 witchcraft

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to