Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

Andy Mon, 04 Oct 2010 06:44:52 -0700

> > 1) hyphens - if user types "ema" or "e-ma" I want to
> > suggest "email"
> > 
> > 2) accents - if user types "herme"  want to suggest
> > "Hermès"
> 
> Accents can be removed with using MappingCharFilterFactory
> before the tokenizer. (both index and query time)
> 
> <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
> 
> I am not sure if this is most elegant solution but you can
> replace - with "" uing MappingCharFilterFactory too. It
> satisfies what you describe in 1.
> 
> But generally NGramFilterFactory produces a lot of tokens.
> I mean query er can return hermes. May be
> EdgeNGramFilterFactory can be more suitable for
> auto-complete task. At least it guarantees that some word is
> starting with that character sequence.


Thanks.

I agree with the issues with NGramFilterFactory you pointed out and I really 
want to avoid using it. But the problem is that I have Chinese tags like "电吉他" 
and multi-lingual tags like "electric吉他".

For tags like that WhitespaceTokenizerFactory wouldn't work. And if I use 
ChineseFilterFactory would it recognize that the "electric" in "electric吉他" 
isn't Chinese and shouldn't be split into individual characters?

Any ideas here are greatly appreciated.

In a related matter, I checked out 
http://lucene.apache.org/solr/api/org/apache/solr/analysis/package-tree.html 
and saw that there are:

EdgeNGramFilterFactory & EdgeNGramTokenizerFactory
NGramFilterFactory & NGramTokenizerFactory

What are the differences between *FilterFactory and *TokenizerFactory? In my 
case which one should I be using?

Thanks.

Re: NGramFilterFactory for auto-complete that matches the middle of multi-lingual tags?

Reply via email to