Re: Index search questions; special cases

2006-11-19 Thread Chris Hostetter
: Chris, thanks for the tips (or should I say, detailed explanation!). I : actually got it working! It was a pain at first (never did any java, and good to know .. glad it worked out for you. : If Solr is interested in the filter, just tell me (and how should I do : to contribute it). The full

Re: Index search questions; special cases

2006-11-18 Thread Michael Imbeault
CommonGrams itself seems to have some other dependencies on nutch because of other utilities in the same class, but based on a quick skim, what you really want is the nested private static class Filter extends TokenFilter which doesn't really have any external dependencies. If you extract that

Re: Index search questions; special cases

2006-11-15 Thread Sami Siren
Erik Hatcher wrote: Yeah, the Nutch code is highly intertwined with its unique configuration infrastructure and makes it hard to pull pieces of it out like this. This is a critique that has been heard a lot (mainly because its true :) It would be really cool if different camps of lucene could

Re: Index search questions; special cases

2006-11-15 Thread Chris Hostetter
: Yeah, the Nutch code is highly intertwined with its unique configuration : infrastructure and makes it hard to pull pieces of it out like this. that CacheGrams inner Filter classe seemed like it could be extracted easily enough. : This is a critique that has been heard a lot (mainly because

Re: Index search questions; special cases

2006-11-14 Thread Chris Hostetter
: : Nutch has phrase pre-filtering which helps with this. It indexes the : : phrase fragments as separate terms and uses that set of matches to : : filter the set of matching documents. : That reminds me ... i seem to remember someone saying once that Nutch lso : builds word based n-grams

Re: Index search questions; special cases

2006-11-14 Thread Erik Hatcher
On Nov 14, 2006, at 2:00 PM, Chris Hostetter wrote: CommonGrams itself seems to have some other dependencies on nutch because of other utilities in the same class, but based on a quick skim, what you really want is the nested private static class Filter extends TokenFilter which doesn't

Re: Index search questions; special cases

2006-11-13 Thread Walter Underwood
On 11/12/06 8:52 PM, Michael Imbeault [EMAIL PROTECTED] wrote: Sadly I can't rely on users smartness for this :) I have concerns that for stuff like Hepatitis A, it will match just about every document containing hepatitis and the very common 'a' word, anywhere in the document. I can't

Re: Index search questions; special cases

2006-11-13 Thread Yonik Seeley
On 11/13/06, Walter Underwood [EMAIL PROTECTED] wrote: Another approach is to implement protected phrases, similar to the protected words in stemming. These would be protected from stopword processing. One could use the synonym filter (which can handle multi-token synonyms) to get this effect.

Re: Index search questions; special cases

2006-11-13 Thread Chris Hostetter
: Sadly I can't rely on users smartness for this :) I have concerns that : for stuff like Hepatitis A, it will match just about every document : containing hepatitis and the very common 'a' word, anywhere in the : document. I can't stopword single letters, cause then there would be no : way

Re: Index search questions; special cases

2006-11-13 Thread Yonik Seeley
On 11/12/06, Michael Imbeault [EMAIL PROTECTED] wrote: - Somewhat related : Let's say I index Polymyxin B. If I stopword single letters, would a phrase search (Polymyxin B) still find the right documents (I don't think so, but still)? If not, I'll have to index single letters; how do I prevent

Re: Index search questions; special cases

2006-11-13 Thread Yonik Seeley
On 11/13/06, Yonik Seeley [EMAIL PROTECTED] wrote: The SynonymFilter could have the following config: hepatitis a, hepatitis_a Oops, the synonyms should be reversed like so: hepatitis_a, hepatitis a so that when expand=false for querying, hepatitis a is mapped to hepatitis_a -Yonik

Re: Index search questions; special cases

2006-11-13 Thread Erik Hatcher
On Nov 13, 2006, at 1:51 PM, Chris Hostetter wrote: That reminds me ... i seem to remember someone saying once that Nutch lso builds word based n-grams out of it's stop words, so searches on the or on won't match anything because those words are never indexed as a single tokens, but if a

Re: Index search questions; special cases

2006-11-13 Thread Otis Gospodnetic
Indeed. CommonGrams.java in Nutch is the place to look. Otis - Original Message From: Erik Hatcher [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Monday, November 13, 2006 2:08:51 PM Subject: Re: Index search questions; special cases On Nov 13, 2006, at 1:51 PM, Chris

Re: Index search questions; special cases

2006-11-13 Thread Michael Imbeault
Hello everyone, Thanks for all your answers; synonyms based approaches won't work because the medical / research field is evolving way too fast; it would become unmaintainable very quickly, and the list would be huge. Anyway, I can't rely on score because I'm sorting by date, so I need to

Index search questions; special cases

2006-11-12 Thread Michael Imbeault
Hello again, - Let's say I index HIV-1 with filter class=solr.WordDelimiterFilterFactory generateWordParts=1 generateNumberParts=1 catenateWords=1 catenateNumbers=1 catenateAll=1/. Would a search on HIV AND 1 (or even HIV-1, which after parsing by the above filter would yield HIV1 or HIV 1)

Re: Index search questions; special cases

2006-11-12 Thread Chris Hostetter
: - Let's say I index HIV-1 with filter : class=solr.WordDelimiterFilterFactory generateWordParts=1 : generateNumberParts=1 catenateWords=1 catenateNumbers=1 : catenateAll=1/. Would a search on HIV AND 1 (or even HIV-1, which : after parsing by the above filter would yield HIV1 or HIV 1) also

Re: Index search questions; special cases

2006-11-12 Thread Michael Imbeault
Chris Hostetter wrote: A couple of things make your question really hard to answer ... first off, you can specify differnet analyser chains for index time and query time -- shen dealing with the WordDelim filter (or the synonym fitler) this is frequently neccessary -- so the ansers to your