Re: Index & search questions; special cases

2006-11-13 Thread Walter Underwood
On 11/12/06 8:52 PM, "Michael Imbeault" <[EMAIL PROTECTED]>
wrote:

> Sadly I can't rely on users smartness for this :) I have concerns that
> for stuff like Hepatitis A, it will match just about every document
> containing hepatitis and the very common 'a' word, anywhere in the
> document. I can't stopword single letters, cause then there would be no
> way to find documents about 'hepatitis c' and not about 'hepatitis b'
> for example. I will test my solution and report; if you have any other
> ideas, just tell me.

Nutch has phrase pre-filtering which helps with this. It indexes the
phrase fragments as separate terms and uses that set of matches to
filter the set of matching documents.

Another approach is to implement protected phrases, similar to the
protected words in stemming. These would be protected from stopword
processing.

A list of exception word and phrases is a pretty common trick in
other engines. Otherwise, you go nuts trying to get your analyzer
to handle ".NET" and "vitamin a". I know that AltaVista and Inktomi
did this.

wunder
-- 
Walter Underwood
Search Guru, Netflix

 



Re: Index & search questions; special cases

2006-11-13 Thread Yonik Seeley

On 11/13/06, Walter Underwood <[EMAIL PROTECTED]> wrote:

Another approach is to implement protected phrases, similar to the
protected words in stemming. These would be protected from stopword
processing.


One could use the synonym filter (which can handle multi-token
synonyms) to get this effect.

WordDelimiterFilter => SynonymFilter => StopwordFilter => Stemmer

The SynonymFilter could have the following config:
hepatitis a, hepatitis_a

Do expand="true" on the indexing analyzer, and expand="false" on the
query analyzer

Then, a doc with "hepatitis a" will end up indexing "hepatitus" and
"hepatitis_a"
And at query time all the following searches will find the doc:
  text:hepatitus
  text:"hepatitis a"
  text:"hepatitis-a"


A list of exception word and phrases is a pretty common trick in
other engines. Otherwise, you go nuts trying to get your analyzer
to handle ".NET" and "vitamin a". I know that AltaVista and Inktomi
did this.


That's not a bad idea... most of the code from the multi-token
SynonymFilter could be reused to efficiently recognize multi-token
matches.

-Yonik


Re: Index & search questions; special cases

2006-11-13 Thread Chris Hostetter

: > Sadly I can't rely on users smartness for this :) I have concerns that
: > for stuff like Hepatitis A, it will match just about every document
: > containing hepatitis and the very common 'a' word, anywhere in the
: > document. I can't stopword single letters, cause then there would be no
: > way to find documents about 'hepatitis c' and not about 'hepatitis b'

: Nutch has phrase pre-filtering which helps with this. It indexes the
: phrase fragments as separate terms and uses that set of matches to
: filter the set of matching documents.

That reminds me ... i seem to remember someone saying once that Nutch lso
builds word based n-grams out of it's stop words, so searches on "the"
or "on" won't match anything because those words are never indexed as a
single tokens, but if a document contains "the dog in the house" it would
match a search on "in the" becaue the Analyzer would treat that as a
single token "in_the".

something like thta might work as well.



-Hoss



Re: Index & search questions; special cases

2006-11-13 Thread Yonik Seeley

On 11/12/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

- Somewhat related : Let's say I index "Polymyxin B". If I stopword
single letters, would a phrase search ("Polymyxin B") still find the
right documents (I don't think so, but still)? If not, I'll have to
index single letters; how do I prevent the same problem as in the first
question (i.e., a search on Polymyxin B yielding documents with
Polymyxin and B, but not close to one another).


The general problem seems that you can tell what should be in a phrase
search and what shouldn't

You could try throwing everything in a sloppy phrase query, so at
least scores will go up when terms are closer together (in general).

You could also try an exact phrase query, and if you don't get enough
results, follow it up with another strategy (like what you have
below).


My thought is to parse the user query and rephrase it to do phrase
searches on nearby terms containing single letters / numbers. If an user
search for HIV 1 hepatitis, I'd rewrite it as ("HIV 1" AND hepatitis) OR
("1 hepatitis" AND hiv). Is it a sensible solution?


That might work.
Whatever general strategy you end up trying, you can probably boost
relevancy with some domain specific knowledge injected with something
like the SynonymFilter.

-Yonik


Re: Index & search questions; special cases

2006-11-13 Thread Yonik Seeley

On 11/13/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

The SynonymFilter could have the following config:
hepatitis a, hepatitis_a


Oops, the synonyms should be reversed like so:
hepatitis_a, hepatitis a
so that when expand="false" for querying, hepatitis a is mapped to hepatitis_a

-Yonik


Re: Index & search questions; special cases

2006-11-13 Thread Erik Hatcher


On Nov 13, 2006, at 1:51 PM, Chris Hostetter wrote:
That reminds me ... i seem to remember someone saying once that  
Nutch lso

builds word based n-grams out of it's stop words, so searches on "the"
or "on" won't match anything because those words are never indexed  
as a
single tokens, but if a document contains "the dog in the house" it  
would

match a search on "in the" becaue the Analyzer would treat that as a
single token "in_the".



Yup we covered this in LIA:






Re: Index & search questions; special cases

2006-11-13 Thread Otis Gospodnetic
Indeed.  CommonGrams.java in Nutch is the place to look.

Otis

- Original Message 
From: Erik Hatcher <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, November 13, 2006 2:08:51 PM
Subject: Re: Index & search questions; special cases


On Nov 13, 2006, at 1:51 PM, Chris Hostetter wrote:
> That reminds me ... i seem to remember someone saying once that  
> Nutch lso
> builds word based n-grams out of it's stop words, so searches on "the"
> or "on" won't match anything because those words are never indexed  
> as a
> single tokens, but if a document contains "the dog in the house" it  
> would
> match a search on "in the" becaue the Analyzer would treat that as a
> single token "in_the".


Yup we covered this in LIA:









Re: Index & search questions; special cases

2006-11-13 Thread Michael Imbeault

Hello everyone,

Thanks for all your answers; synonyms based approaches won't work 
because the medical / research field is evolving way too fast; it would 
become unmaintainable very quickly, and the list would be huge. Anyway, 
I can't rely on score because I'm sorting by date, so I need to 
eliminate the 'hiv' in one part of the doc and '1' in another part 
problem completely (if I want docs that fits HIV-1, or Polymyxin B, or 
hepatitis A - I don't want docs that fits 'A patient was cured of 
hepatitis C' if I search for 'hepatitis a').

: Nutch has phrase pre-filtering which helps with this. It indexes the
: phrase fragments as separate terms and uses that set of matches to
: filter the set of matching documents.
  
Is this a filter that I could implement easily into Solr? I never did 
java, but it can't be that complicated I guess. Any help would be 
appreciated.



That reminds me ... i seem to remember someone saying once that Nutch lso
builds word based n-grams out of it's stop words, so searches on "the"
or "on" won't match anything because those words are never indexed as a
single tokens, but if a document contains "the dog in the house" it would
match a search on "in the" because the Analyzer would treat that as a
single token "in_the".
  


This looks like exactly what I'm looking for. Is it related to the above 
'nutch pre-filtering'? This way if I stopword single letters and 
numbers, it would still index 'hepatitis_a' as a single token, and match 
a search on 'hepatitis a' (non-phrase search) without hitting 'a patient 
has hepatitis'? I guess i'd have to apply the filter to the query too, 
so it turns the query into hepatitis_a?


Basically, its another way to what I proposed as a solution - rewrite 
the query to include phrase queries when you find a stopword, if you 
index them anyway. Still, this solution looks better, as the size of the 
index would probably be smaller than if I didn't stopword single letters 
at all? For reference, what I proposed was:


My thought is to parse the user query and rephrase it to do phrase 
searches on nearby terms containing single letters / numbers. If an 
user search for HIV 1 hepatitis, I'd rewrite it as ("HIV 1" AND 
hepatitis) OR ("1 hepatitis" AND hiv). Is it a sensible solution?
Any chance at all this kind of filter gets implemented into solr? If 
not, indications on how to do it myself would be appreciated - I can't 
say I have a clue right now (never did java, the only lucene programming 
I did was via a php bridge).


Thanks for the help,

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212





Re: Re: Index & search questions; special cases

2006-11-13 Thread Mike Klaas

On 11/13/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

Hello everyone,

Thanks for all your answers; synonyms based approaches won't work
because the medical / research field is evolving way too fast; it would


Another approach is to extract the term explicitly.  An
easy-to-implement approach is the C/NC ATR algorithm.

-Mike