: Chris, thanks for the tips (or should I say, detailed explanation!). I
: actually got it working! It was a pain at first (never did any java, and
good to know .. glad it worked out for you.
: If Solr is interested in the filter, just tell me (and how should I do
: to contribute it).
The full
CommonGrams itself seems to have some other dependencies on nutch because
of other utilities in the same class, but based on a quick skim, what you
really want is the nested private static class Filter extends
TokenFilter which doesn't really have any external dependencies. If you
extract that
Erik Hatcher wrote:
Yeah, the Nutch code is highly intertwined with its unique configuration
infrastructure and makes it hard to pull pieces of it out like this.
This is a critique that has been heard a lot (mainly because its true :)
It would be really cool if different camps of lucene could
: Yeah, the Nutch code is highly intertwined with its unique configuration
: infrastructure and makes it hard to pull pieces of it out like this.
that CacheGrams inner Filter classe seemed like it could be extracted
easily enough.
: This is a critique that has been heard a lot (mainly because
: : Nutch has phrase pre-filtering which helps with this. It indexes the
: : phrase fragments as separate terms and uses that set of matches to
: : filter the set of matching documents.
: That reminds me ... i seem to remember someone saying once that Nutch lso
: builds word based n-grams
On Nov 14, 2006, at 2:00 PM, Chris Hostetter wrote:
CommonGrams itself seems to have some other dependencies on nutch
because
of other utilities in the same class, but based on a quick skim,
what you
really want is the nested private static class Filter extends
TokenFilter which doesn't
On 11/12/06 8:52 PM, Michael Imbeault [EMAIL PROTECTED]
wrote:
Sadly I can't rely on users smartness for this :) I have concerns that
for stuff like Hepatitis A, it will match just about every document
containing hepatitis and the very common 'a' word, anywhere in the
document. I can't
On 11/13/06, Walter Underwood [EMAIL PROTECTED] wrote:
Another approach is to implement protected phrases, similar to the
protected words in stemming. These would be protected from stopword
processing.
One could use the synonym filter (which can handle multi-token
synonyms) to get this effect.
: Sadly I can't rely on users smartness for this :) I have concerns that
: for stuff like Hepatitis A, it will match just about every document
: containing hepatitis and the very common 'a' word, anywhere in the
: document. I can't stopword single letters, cause then there would be no
: way
On 11/12/06, Michael Imbeault [EMAIL PROTECTED] wrote:
- Somewhat related : Let's say I index Polymyxin B. If I stopword
single letters, would a phrase search (Polymyxin B) still find the
right documents (I don't think so, but still)? If not, I'll have to
index single letters; how do I prevent
On 11/13/06, Yonik Seeley [EMAIL PROTECTED] wrote:
The SynonymFilter could have the following config:
hepatitis a, hepatitis_a
Oops, the synonyms should be reversed like so:
hepatitis_a, hepatitis a
so that when expand=false for querying, hepatitis a is mapped to hepatitis_a
-Yonik
On Nov 13, 2006, at 1:51 PM, Chris Hostetter wrote:
That reminds me ... i seem to remember someone saying once that
Nutch lso
builds word based n-grams out of it's stop words, so searches on the
or on won't match anything because those words are never indexed
as a
single tokens, but if a
Indeed. CommonGrams.java in Nutch is the place to look.
Otis
- Original Message
From: Erik Hatcher [EMAIL PROTECTED]
To: solr-user@lucene.apache.org
Sent: Monday, November 13, 2006 2:08:51 PM
Subject: Re: Index search questions; special cases
On Nov 13, 2006, at 1:51 PM, Chris
Hello everyone,
Thanks for all your answers; synonyms based approaches won't work
because the medical / research field is evolving way too fast; it would
become unmaintainable very quickly, and the list would be huge. Anyway,
I can't rely on score because I'm sorting by date, so I need to
Hello again,
- Let's say I index HIV-1 with filter
class=solr.WordDelimiterFilterFactory generateWordParts=1
generateNumberParts=1 catenateWords=1 catenateNumbers=1
catenateAll=1/. Would a search on HIV AND 1 (or even HIV-1, which
after parsing by the above filter would yield HIV1 or HIV 1)
: - Let's say I index HIV-1 with filter
: class=solr.WordDelimiterFilterFactory generateWordParts=1
: generateNumberParts=1 catenateWords=1 catenateNumbers=1
: catenateAll=1/. Would a search on HIV AND 1 (or even HIV-1, which
: after parsing by the above filter would yield HIV1 or HIV 1) also
Chris Hostetter wrote:
A couple of things make your question really hard to answer ... first off,
you can specify differnet analyser chains for index time and query time --
shen dealing with the WordDelim filter (or the synonym fitler) this is
frequently neccessary -- so the ansers to your
17 matches
Mail list logo