On 11/13/06, Walter Underwood <[EMAIL PROTECTED]> wrote:
Another approach is to implement protected phrases, similar to the
protected words in stemming. These would be protected from stopword
processing.

One could use the synonym filter (which can handle multi-token
synonyms) to get this effect.

WordDelimiterFilter => SynonymFilter => StopwordFilter => Stemmer

The SynonymFilter could have the following config:
hepatitis a, hepatitis_a

Do expand="true" on the indexing analyzer, and expand="false" on the
query analyzer

Then, a doc with "hepatitis a" will end up indexing "hepatitus" and
"hepatitis_a"
And at query time all the following searches will find the doc:
  text:hepatitus
  text:"hepatitis a"
  text:"hepatitis-a"

A list of exception word and phrases is a pretty common trick in
other engines. Otherwise, you go nuts trying to get your analyzer
to handle ".NET" and "vitamin a". I know that AltaVista and Inktomi
did this.

That's not a bad idea... most of the code from the multi-token
SynonymFilter could be reused to efficiently recognize multi-token
matches.

-Yonik

Reply via email to