> > 1) hyphens - if user types "ema" or "e-ma" I want to > > suggest "email" > > > > 2) accents - if user types "herme" want to suggest > > "Hermès" > > Accents can be removed with using MappingCharFilterFactory > before the tokenizer. (both index and query time) > > <charFilter class="solr.MappingCharFilterFactory" > mapping="mapping-ISOLatin1Accent.txt"/> > > I am not sure if this is most elegant solution but you can > replace - with "" uing MappingCharFilterFactory too. It > satisfies what you describe in 1. > > But generally NGramFilterFactory produces a lot of tokens. > I mean query er can return hermes. May be > EdgeNGramFilterFactory can be more suitable for > auto-complete task. At least it guarantees that some word is > starting with that character sequence.
Thanks. I agree with the issues with NGramFilterFactory you pointed out and I really want to avoid using it. But the problem is that I have Chinese tags like "电吉他" and multi-lingual tags like "electric吉他". For tags like that WhitespaceTokenizerFactory wouldn't work. And if I use ChineseFilterFactory would it recognize that the "electric" in "electric吉他" isn't Chinese and shouldn't be split into individual characters? Any ideas here are greatly appreciated. In a related matter, I checked out http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-tree.html and saw that there are: EdgeNGramFilterFactory & EdgeNGramTokenizerFactory NGramFilterFactory & NGramTokenizerFactory What are the differences between *FilterFactory and *TokenizerFactory? In my case which one should I be using? Thanks.