Re: Index & search questions; special cases

Michael Imbeault Mon, 13 Nov 2006 20:35:46 -0800

Hello everyone,

Thanks for all your answers; synonyms based approaches won't workbecause the medical / research field is evolving way too fast; it wouldbecome unmaintainable very quickly, and the list would be huge. Anyway,I can't rely on score because I'm sorting by date, so I need toeliminate the 'hiv' in one part of the doc and '1' in another partproblem completely (if I want docs that fits HIV-1, or Polymyxin B, orhepatitis A - I don't want docs that fits 'A patient was cured ofhepatitis C' if I search for 'hepatitis a').

: Nutch has phrase pre-filtering which helps with this. It indexes the
: phrase fragments as separate terms and uses that set of matches to
: filter the set of matching documents.

Is this a filter that I could implement easily into Solr? I never didjava, but it can't be that complicated I guess. Any help would beappreciated.

That reminds me ... i seem to remember someone saying once that Nutch lso
builds word based n-grams out of it's stop words, so searches on "the"
or "on" won't match anything because those words are never indexed as a
single tokens, but if a document contains "the dog in the house" it would
match a search on "in the" because the Analyzer would treat that as a
single token "in_the".

This looks like exactly what I'm looking for. Is it related to the above'nutch pre-filtering'? This way if I stopword single letters andnumbers, it would still index 'hepatitis_a' as a single token, and matcha search on 'hepatitis a' (non-phrase search) without hitting 'a patienthas hepatitis'? I guess i'd have to apply the filter to the query too,so it turns the query into hepatitis_a?

Basically, its another way to what I proposed as a solution - rewritethe query to include phrase queries when you find a stopword, if youindex them anyway. Still, this solution looks better, as the size of theindex would probably be smaller than if I didn't stopword single lettersat all? For reference, what I proposed was:

My thought is to parse the user query and rephrase it to do phrasesearches on nearby terms containing single letters / numbers. If anuser search for HIV 1 hepatitis, I'd rewrite it as ("HIV 1" ANDhepatitis) OR ("1 hepatitis" AND hiv). Is it a sensible solution?

Any chance at all this kind of filter gets implemented into solr? Ifnot, indications on how to do it myself would be appreciated - I can'tsay I have a clue right now (never did java, the only lucene programmingI did was via a php bridge).


Thanks for the help,

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212

Re: Index & search questions; special cases

Reply via email to