Re: solr wildcard queries and analyzers

Matti Oinas Tue, 11 Jan 2011 04:26:18 -0800

Sorry, the message was not meant to be sent here. We are struggling
with the same problem here.


2011/1/11 Matti Oinas <matti.oi...@gmail.com>:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers
>
> On wildcard and fuzzy searches, no text analysis is performed on the
> search word.
>
> 2011/1/11 Kári Hreinsson <k...@gagnavarslan.is>:
>> Hi,
>>
>> I am having a problem with the fact that no text analysis are performed on 
>> wildcard queries.  I have the following field type (a bit simplified):
>>    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>>      <analyzer>
>>        <tokenizer class="solr.WhitespaceTokenizerFactory" />
>>        <filter class="solr.TrimFilterFactory" />
>>        <filter class="solr.LowerCaseFilterFactory" />
>>        <filter class="solr.ASCIIFoldingFilterFactory" />
>>      </analyzer>
>>    </fieldType>
>>
>> My problem has to do with Icelandic characters, when I index a document with 
>> a text field including the word "sjálfsögðu" it gets indexed as "sjalfsogdu" 
>> (because of the ASCIIFoldingFilterFactory which replaces the Icelandic 
>> characters with their English equivalents).  Then, when I search (without a 
>> wildcard) for "sjálfsögðu" or "sjalfsogdu" I get that document as a result.  
>> This is convenient since it enables people to search without using accented 
>> characters and yet get the results they want (e.g. if they are working on 
>> computers with English keyboards).
>>
>> However this all falls apart when using wildcard searches, then the search 
>> string isn't passed through the filters, and even if I search for "sjálf*" I 
>> don't get any results because the index doesn't contain the original words 
>> (I get result if I search for "sjalf*").  I know people have been having a 
>> similar problem with the case sensitivity of wildcard queries and most often 
>> the solution seems to be to lowercase the string before passing it on to 
>> solr, which is not exactly an optimal solution (yet a simple one in that 
>> case).  The Icelandic characters complicate things a bit and applying the 
>> same solution (doing the lowercasing and character mapping) in my 
>> application seems like unnecessary duplication of code already part of solr, 
>> not to mention complication of my application and possible maintenance down 
>> the road.
>>
>> Is there any way around this?  How are people solving this?  Is there a way 
>> to apply the filters to wildcard queries?  I guess removing the 
>> ASCIIFoldingFilterFactory is the simplest "solution" but this 
>> "normalization" (of the text done by the filter) is often very useful.
>>
>> I hope I'm not overlooking some obvious explanation. :/
>>
>> Thanks in advance,
>> Kári Hreinsson
>>
>

Re: solr wildcard queries and analyzers

Reply via email to