I want to provide auto-complete to users when they're inputting tags. The 
auto-complete tag suggestions would be based on tags that are already in the 
system.

Multiple tags are separated by commas. A single tag could contain multiple 
words such as "Apple computer".

One issue is that a tag could be in multiple languages, including both 
languages (e.g. English, French) that use whitespace as word separator and 
languages that don't (e.g. CJK)

An example of such a multi-lingual tag is "Apple 电脑".

If a user types "apple", I'd like the autocomplete suggestions to include both 
"Apple computer" (ie. matches are case insensitive) and "green apple" (ie. 
matches aren't restricted to prefixes). And a user typing "电脑" should match 
"Apple 电脑".

Is it possible to do that? I read the article:
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/

In that article KeywordTokenizerFactor is used. If I changed it to CJKTokenizer 
would that work? 

With an input of "Apple 电脑", what would CJKTokenizer produce?

-is it "Apple", "电", "脑" ?
or
- is it "A", "p", "p", "l", "e", "电", "脑" ?

Any help would be greatly appreciated.

Andy



Reply via email to