This might be the solution.

http://lucene.apache.org/java/3_0_2/api/contrib-misc/org/apache/lucene/queryParser/analyzing/AnalyzingQueryParser.html

2011/1/11 Matti Oinas <matti.oi...@gmail.com>:
> Sorry, the message was not meant to be sent here. We are struggling
> with the same problem here.
>
> 2011/1/11 Matti Oinas <matti.oi...@gmail.com>:
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers
>>
>> On wildcard and fuzzy searches, no text analysis is performed on the
>> search word.
>>
>> 2011/1/11 Kári Hreinsson <k...@gagnavarslan.is>:
>>> Hi,
>>>
>>> I am having a problem with the fact that no text analysis are performed on 
>>> wildcard queries.  I have the following field type (a bit simplified):
>>>    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>>>      <analyzer>
>>>        <tokenizer class="solr.WhitespaceTokenizerFactory" />
>>>        <filter class="solr.TrimFilterFactory" />
>>>        <filter class="solr.LowerCaseFilterFactory" />
>>>        <filter class="solr.ASCIIFoldingFilterFactory" />
>>>      </analyzer>
>>>    </fieldType>
>>>
>>> My problem has to do with Icelandic characters, when I index a document 
>>> with a text field including the word "sjálfsögðu" it gets indexed as 
>>> "sjalfsogdu" (because of the ASCIIFoldingFilterFactory which replaces the 
>>> Icelandic characters with their English equivalents).  Then, when I search 
>>> (without a wildcard) for "sjálfsögðu" or "sjalfsogdu" I get that document 
>>> as a result.  This is convenient since it enables people to search without 
>>> using accented characters and yet get the results they want (e.g. if they 
>>> are working on computers with English keyboards).
>>>
>>> However this all falls apart when using wildcard searches, then the search 
>>> string isn't passed through the filters, and even if I search for "sjálf*" 
>>> I don't get any results because the index doesn't contain the original 
>>> words (I get result if I search for "sjalf*").  I know people have been 
>>> having a similar problem with the case sensitivity of wildcard queries and 
>>> most often the solution seems to be to lowercase the string before passing 
>>> it on to solr, which is not exactly an optimal solution (yet a simple one 
>>> in that case).  The Icelandic characters complicate things a bit and 
>>> applying the same solution (doing the lowercasing and character mapping) in 
>>> my application seems like unnecessary duplication of code already part of 
>>> solr, not to mention complication of my application and possible 
>>> maintenance down the road.
>>>
>>> Is there any way around this?  How are people solving this?  Is there a way 
>>> to apply the filters to wildcard queries?  I guess removing the 
>>> ASCIIFoldingFilterFactory is the simplest "solution" but this 
>>> "normalization" (of the text done by the filter) is often very useful.
>>>
>>> I hope I'm not overlooking some obvious explanation. :/
>>>
>>> Thanks in advance,
>>> Kári Hreinsson
>>>
>>
>

Reply via email to