: > Sadly I can't rely on users smartness for this :) I have concerns that
: > for stuff like Hepatitis A, it will match just about every document
: > containing hepatitis and the very common 'a' word, anywhere in the
: > document. I can't stopword single letters, cause then there would be no
: > way to find documents about 'hepatitis c' and not about 'hepatitis b'

: Nutch has phrase pre-filtering which helps with this. It indexes the
: phrase fragments as separate terms and uses that set of matches to
: filter the set of matching documents.

That reminds me ... i seem to remember someone saying once that Nutch lso
builds word based n-grams out of it's stop words, so searches on "the"
or "on" won't match anything because those words are never indexed as a
single tokens, but if a document contains "the dog in the house" it would
match a search on "in the" becaue the Analyzer would treat that as a
single token "in_the".

something like thta might work as well.



-Hoss

Reply via email to