Gilles and Geoff,

Thank you for your expalnation.  I've made the changes that "downgrade"
the version (I've commented out the part of code that tests if the word
is the root), and recompiled htdig.  The error with "scater" disappeared
indeed, but the problem with Russian endings still exists.  I suppose the
problem is in that the Endings::getWords function stops searching after
the first occurrence of the word is found and doesn't check if the end
of the database is reached.  (So the solution may be quite easy: to add
a loop that looks for a word until the end of the database is reached,
and if the word is not found, suppose that it's a root itself.)

I'd like to add a few words about the changes made in version 3.1.3.
Don't mix two problems: the quality of the search engine (search algorithm)
and the quality of ispell dictionaries.  Ispell dictionaries were created
to solve _another_ problem, and the munchlist approach used in it is the way
to optimize the dictionary structure for that purpose.

The dictionary we need for htdig is a different thing as it should be
organized as a list of records that correspond to the grammar rules.
(BTW, my Russian ispell dictionaries don't use munchlist approach at all
and strictly follow the grammar.)  I think it would be better to leave
the correctness of the search algorithm untouched (as in version 3.1.2)
and make changes to the English word dictionary to tailor the htdig needs.
So, if you find that the -ness ending is harmful, it's better to create
a patch that corrects the english.0 file (the original file should be
distributed in its original form) and do not change the correct search
algorithm.

- Alexander

-----------------------------

Gilles Detillieux <[EMAIL PROTECTED]> writes:

>According to Alexander I. Lebedev:
>> Gilles,
>> 
>> Thank you for your answer.
>
>It was Geoff who answered you last time, but I'm sure you're welcome.  :-)
>
>> >To quote from the documentation: (attrs.html#search_algorithm)
>> > Each word is first reduced to its word root and then all known legal
>> > endings are used for the matching.
>> >
>> >I think the bug basically comes up because there are some subset of
>> >permuations that are also root words. In Endings::getWords, if a word is
>> >already a root word, then it doesn't bother to check if it's also a
>> >permutation.
>> 
>> I'm afraid, the origin of the bug is different.  I tested your idea on
>> one indexed Russian site (26,000 documents) and found the same bug
>> in the case when the word I'm searching for is not a root itself (but
>> have two different roots).  So I guess, the program stops searching
>> when it finds the first occurence of the word, not all of them. (Indeed,
>> in Endings::getWords I don't see the loop that tests if there are other
>> roots.)
>> 
>> - Alexander
>
>What Geoff is describing as a bug in the Endings algorithm is actually a
>deliberate change, submitted by Steve Arlow back in June 1999.  It was to
>prevent the -ness suffix from being stripped on words like witness, and then
>having the the word "wit" expanded with a number of inappropriate suffixes.
>That change was incorporated in version 3.1.3.  However, it does indeed
>appear to be the cause of the problem with the whole "skate" vs. "skater"
>test.
>
>I checked htsearch from 3.1.2, and its unpatched endings algorithm produced
>the same results with "skate" or "skater", i.e.
>
>   (skater or skate or skated or skating or skates or skaters)
>
>So, either the problem you've run into with Russian words is different than
>the skater test, or going back to 3.1.2 will solve the problem for you.
>
>In either case, I think Steve Arlow's patch is more far-reaching than any
>of us thought before.  It seems to me that this should be optional, unless
>we can find a smarter way of doing this.


_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to