Hello everyone,

Thanks for all your answers; synonyms based approaches won't work because the medical / research field is evolving way too fast; it would become unmaintainable very quickly, and the list would be huge. Anyway, I can't rely on score because I'm sorting by date, so I need to eliminate the 'hiv' in one part of the doc and '1' in another part problem completely (if I want docs that fits HIV-1, or Polymyxin B, or hepatitis A - I don't want docs that fits 'A patient was cured of hepatitis C' if I search for 'hepatitis a').
: Nutch has phrase pre-filtering which helps with this. It indexes the
: phrase fragments as separate terms and uses that set of matches to
: filter the set of matching documents.
Is this a filter that I could implement easily into Solr? I never did java, but it can't be that complicated I guess. Any help would be appreciated.

That reminds me ... i seem to remember someone saying once that Nutch lso
builds word based n-grams out of it's stop words, so searches on "the"
or "on" won't match anything because those words are never indexed as a
single tokens, but if a document contains "the dog in the house" it would
match a search on "in the" because the Analyzer would treat that as a
single token "in_the".

This looks like exactly what I'm looking for. Is it related to the above 'nutch pre-filtering'? This way if I stopword single letters and numbers, it would still index 'hepatitis_a' as a single token, and match a search on 'hepatitis a' (non-phrase search) without hitting 'a patient has hepatitis'? I guess i'd have to apply the filter to the query too, so it turns the query into hepatitis_a?

Basically, its another way to what I proposed as a solution - rewrite the query to include phrase queries when you find a stopword, if you index them anyway. Still, this solution looks better, as the size of the index would probably be smaller than if I didn't stopword single letters at all? For reference, what I proposed was:

My thought is to parse the user query and rephrase it to do phrase searches on nearby terms containing single letters / numbers. If an user search for HIV 1 hepatitis, I'd rewrite it as ("HIV 1" AND hepatitis) OR ("1 hepatitis" AND hiv). Is it a sensible solution?
Any chance at all this kind of filter gets implemented into solr? If not, indications on how to do it myself would be appreciated - I can't say I have a clue right now (never did java, the only lucene programming I did was via a php bridge).

Thanks for the help,

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Reply via email to