On 11/13/06, Walter Underwood <[EMAIL PROTECTED]> wrote:
Another approach is to implement protected phrases, similar to the protected words in stemming. These would be protected from stopword processing.
One could use the synonym filter (which can handle multi-token synonyms) to get this effect. WordDelimiterFilter => SynonymFilter => StopwordFilter => Stemmer The SynonymFilter could have the following config: hepatitis a, hepatitis_a Do expand="true" on the indexing analyzer, and expand="false" on the query analyzer Then, a doc with "hepatitis a" will end up indexing "hepatitus" and "hepatitis_a" And at query time all the following searches will find the doc: text:hepatitus text:"hepatitis a" text:"hepatitis-a"
A list of exception word and phrases is a pretty common trick in other engines. Otherwise, you go nuts trying to get your analyzer to handle ".NET" and "vitamin a". I know that AltaVista and Inktomi did this.
That's not a bad idea... most of the code from the multi-token SynonymFilter could be reused to efficiently recognize multi-token matches. -Yonik