Re: Multiple passes with WordDelimiterFilterFactory

Erick Erickson Sun, 29 Aug 2010 13:17:51 -0700

There's nothing built into SOLR that I know of that'll deal with
auto-detecting
multiple languages and "doing the right thing". I know there's been
discussion
of that, searching the users' list might help... You may have to write your
own
analyzer that tries to do this, but I have no clue how you'd go about it.


<<<charFilters are applied even before the tokenizer>>>
Try putting this after any instances of, say, WhiteSpaceTokenizerFactory
in your analyzser definition, and I believe you'll see that this is not
true.
At least looking at this in the analysis page from SOLR admin sure doesn't
seem to support that assertion.

This last doesn't help much with the different character sets though..

I'll have to leave any other insights to wiser heads than mine though..

Best
Erick

On Sun, Aug 29, 2010 at 12:44 PM, Shawn Heisey <s...@elyograg.org> wrote:

>  Thank you for taking the time to help.  The way I've got the word
> delimiter index filter set up with only one pass, "wolf-biederman" will
> result in wolf, biederman, wolfbiederman, and wolf-biederman.  With two
> passes, the last one is not present.  One pass changes "gremlin's" to
> gremlin and gremlin's.  Two passes results in gremlin and gremlins.
>
> I was trying to use the PatternReplaceCharFilterFactory to strip leading
> and trailing punctuation, but it didn't work.  It seems that charFilters are
> applied even before the tokenizer, which will not produce the results I
> want, and the filter I'd come up with was eating everything, producing no
> results.  I later realized that it would not work with radically different
> character sets like Arabic and Cyrillic, even if I solved those problems.
>  Is there a regular filter that could strip leading/trailing punctuation?
>
> As for stemming, we have no effective way to separate the languages.  Most
> of the content is English, but we also have Spanish, Arabic, Russian,
> German, French, and possibly a few others.  For that reason, I'm not using
> stemming.  I've been thinking that I might want to use an English stemmer
> anyway to improve results on most of the content, but I haven't done any
> testing yet.
>
> Thanks,
> Shawn
>
>
>
> On 8/29/2010 12:28 PM, Erick Erickson wrote:
>
>> Look at the tokenizer/filter chain that makes up your analyzers, and see:
>>
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>>
>> for other tokenizer/analyzer/filter options.
>>
>> You're on the right track looking at the various choices provided, and
>> I suspect you'll find what you need...
>>
>> Be a little cautious about preserving things. Your users will often be
>> more
>> confused than helped if you require hyphens for a match. Ditto with
>> possessives, plurals, etc. You might want to look at stemmers....
>>
>
>

Re: Multiple passes with WordDelimiterFilterFactory

Reply via email to