http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#Analyzers
On wildcard and fuzzy searches, no text analysis is performed on the search word. 2011/1/11 Kári Hreinsson <k...@gagnavarslan.is>: > Hi, > > I am having a problem with the fact that no text analysis are performed on > wildcard queries. I have the following field type (a bit simplified): > <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> > <analyzer> > <tokenizer class="solr.WhitespaceTokenizerFactory" /> > <filter class="solr.TrimFilterFactory" /> > <filter class="solr.LowerCaseFilterFactory" /> > <filter class="solr.ASCIIFoldingFilterFactory" /> > </analyzer> > </fieldType> > > My problem has to do with Icelandic characters, when I index a document with > a text field including the word "sjálfsögðu" it gets indexed as "sjalfsogdu" > (because of the ASCIIFoldingFilterFactory which replaces the Icelandic > characters with their English equivalents). Then, when I search (without a > wildcard) for "sjálfsögðu" or "sjalfsogdu" I get that document as a result. > This is convenient since it enables people to search without using accented > characters and yet get the results they want (e.g. if they are working on > computers with English keyboards). > > However this all falls apart when using wildcard searches, then the search > string isn't passed through the filters, and even if I search for "sjálf*" I > don't get any results because the index doesn't contain the original words (I > get result if I search for "sjalf*"). I know people have been having a > similar problem with the case sensitivity of wildcard queries and most often > the solution seems to be to lowercase the string before passing it on to > solr, which is not exactly an optimal solution (yet a simple one in that > case). The Icelandic characters complicate things a bit and applying the > same solution (doing the lowercasing and character mapping) in my application > seems like unnecessary duplication of code already part of solr, not to > mention complication of my application and possible maintenance down the road. > > Is there any way around this? How are people solving this? Is there a way > to apply the filters to wildcard queries? I guess removing the > ASCIIFoldingFilterFactory is the simplest "solution" but this "normalization" > (of the text done by the filter) is often very useful. > > I hope I'm not overlooking some obvious explanation. :/ > > Thanks in advance, > Kári Hreinsson >