Re: solr wildcard queries and analyzers

Matti Oinas Tue, 11 Jan 2011 04:20:06 -0800

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers


On wildcard and fuzzy searches, no text analysis is performed on the
search word.

2011/1/11 Kári Hreinsson <k...@gagnavarslan.is>:
> Hi,
>
> I am having a problem with the fact that no text analysis are performed on 
> wildcard queries.  I have the following field type (a bit simplified):
>    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>      <analyzer>
>        <tokenizer class="solr.WhitespaceTokenizerFactory" />
>        <filter class="solr.TrimFilterFactory" />
>        <filter class="solr.LowerCaseFilterFactory" />
>        <filter class="solr.ASCIIFoldingFilterFactory" />
>      </analyzer>
>    </fieldType>
>
> My problem has to do with Icelandic characters, when I index a document with 
> a text field including the word "sjálfsögðu" it gets indexed as "sjalfsogdu" 
> (because of the ASCIIFoldingFilterFactory which replaces the Icelandic 
> characters with their English equivalents).  Then, when I search (without a 
> wildcard) for "sjálfsögðu" or "sjalfsogdu" I get that document as a result.  
> This is convenient since it enables people to search without using accented 
> characters and yet get the results they want (e.g. if they are working on 
> computers with English keyboards).
>
> However this all falls apart when using wildcard searches, then the search 
> string isn't passed through the filters, and even if I search for "sjálf*" I 
> don't get any results because the index doesn't contain the original words (I 
> get result if I search for "sjalf*").  I know people have been having a 
> similar problem with the case sensitivity of wildcard queries and most often 
> the solution seems to be to lowercase the string before passing it on to 
> solr, which is not exactly an optimal solution (yet a simple one in that 
> case).  The Icelandic characters complicate things a bit and applying the 
> same solution (doing the lowercasing and character mapping) in my application 
> seems like unnecessary duplication of code already part of solr, not to 
> mention complication of my application and possible maintenance down the road.
>
> Is there any way around this?  How are people solving this?  Is there a way 
> to apply the filters to wildcard queries?  I guess removing the 
> ASCIIFoldingFilterFactory is the simplest "solution" but this "normalization" 
> (of the text done by the filter) is often very useful.
>
> I hope I'm not overlooking some obvious explanation. :/
>
> Thanks in advance,
> Kári Hreinsson
>

Re: solr wildcard queries and analyzers

Reply via email to