Hello everyone,
Thanks for all your answers; synonyms based approaches won't work
because the medical / research field is evolving way too fast; it would
become unmaintainable very quickly, and the list would be huge. Anyway,
I can't rely on score because I'm sorting by date, so I need to
eliminate the 'hiv' in one part of the doc and '1' in another part
problem completely (if I want docs that fits HIV-1, or Polymyxin B, or
hepatitis A - I don't want docs that fits 'A patient was cured of
hepatitis C' if I search for 'hepatitis a').
: Nutch has phrase pre-filtering which helps with this. It indexes the
: phrase fragments as separate terms and uses that set of matches to
: filter the set of matching documents.
Is this a filter that I could implement easily into Solr? I never did
java, but it can't be that complicated I guess. Any help would be
appreciated.
That reminds me ... i seem to remember someone saying once that Nutch lso
builds word based n-grams out of it's stop words, so searches on "the"
or "on" won't match anything because those words are never indexed as a
single tokens, but if a document contains "the dog in the house" it would
match a search on "in the" because the Analyzer would treat that as a
single token "in_the".
This looks like exactly what I'm looking for. Is it related to the above
'nutch pre-filtering'? This way if I stopword single letters and
numbers, it would still index 'hepatitis_a' as a single token, and match
a search on 'hepatitis a' (non-phrase search) without hitting 'a patient
has hepatitis'? I guess i'd have to apply the filter to the query too,
so it turns the query into hepatitis_a?
Basically, its another way to what I proposed as a solution - rewrite
the query to include phrase queries when you find a stopword, if you
index them anyway. Still, this solution looks better, as the size of the
index would probably be smaller than if I didn't stopword single letters
at all? For reference, what I proposed was:
My thought is to parse the user query and rephrase it to do phrase
searches on nearby terms containing single letters / numbers. If an
user search for HIV 1 hepatitis, I'd rewrite it as ("HIV 1" AND
hepatitis) OR ("1 hepatitis" AND hiv). Is it a sensible solution?
Any chance at all this kind of filter gets implemented into solr? If
not, indications on how to do it myself would be appreciated - I can't
say I have a clue right now (never did java, the only lucene programming
I did was via a php bridge).
Thanks for the help,
Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212