: > Sadly I can't rely on users smartness for this :) I have concerns that : > for stuff like Hepatitis A, it will match just about every document : > containing hepatitis and the very common 'a' word, anywhere in the : > document. I can't stopword single letters, cause then there would be no : > way to find documents about 'hepatitis c' and not about 'hepatitis b'
: Nutch has phrase pre-filtering which helps with this. It indexes the : phrase fragments as separate terms and uses that set of matches to : filter the set of matching documents. That reminds me ... i seem to remember someone saying once that Nutch lso builds word based n-grams out of it's stop words, so searches on "the" or "on" won't match anything because those words are never indexed as a single tokens, but if a document contains "the dog in the house" it would match a search on "in the" becaue the Analyzer would treat that as a single token "in_the". something like thta might work as well. -Hoss