Re: Fuzzy searching, tildes and solr

Yonik Seeley Fri, 26 Jan 2007 07:58:42 -0800

On 1/26/07, Walter Lewis <[EMAIL PROTECTED]> wrote:

Yonik Seeley wrote:
> +(+text:jame +text:sutherland) +searchSet:testSet
>> +(+text:james~0.75 +text:sutherland~0.75) +searchSet:testSet
>
> I can tell from the first that this is a stemmed field... "james" is
> transformed to "jame"
"James" being the plural of "Jame" according to the stemmer.  I guess my
mind hadn't run in that direction. :)


I guess I wasn't expecting the fuzzy query logic to bypass the
stemming.


I would expect there to be at least as many problems trying to do
stemming on partial or misspelled words.

For a simpler example, consider prefix queries...
If you tried titie:a* or title:an* to find titles including anaconda,
and you did full "analysis" of the terms first, they would be removed
as stop words and you would find nothing.

 Would it be correct that if I were to add "james" to the
protwords.txt file that this *specific* problem would go away?


Yes, It should.

Obviously
there are a significant quantity of proper names where this would have
an impact, so a more generic solution is preferable.
> So, you could
> - index the field twice using copyField, and then do fuzzy queries on
> the non-stemmed version. [plus two other good suggestions]
As I look at the field types in the example schema would you recommend
something like text_lu without the EnglishPorterFilterFactory, or are
there other issues I'm overlooking.


text_lu also has stemming.

The text field types are examples, and you should be customizing your own.
It depends on how you want to "normalize" text.

You could start make a new field type by starting with your current
text type and removing the stemmer.

-Yonik

Re: Fuzzy searching, tildes and solr

Reply via email to