I want to provide auto-complete to users when they're inputting tags. The auto-complete tag suggestions would be based on tags that are already in the system.
Multiple tags are separated by commas. A single tag could contain multiple words such as "Apple computer". One issue is that a tag could be in multiple languages, including both languages (e.g. English, French) that use whitespace as word separator and languages that don't (e.g. CJK) An example of such a multi-lingual tag is "Apple 电脑". If a user types "apple", I'd like the autocomplete suggestions to include both "Apple computer" (ie. matches are case insensitive) and "green apple" (ie. matches aren't restricted to prefixes). And a user typing "电脑" should match "Apple 电脑". Is it possible to do that? I read the article: http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ In that article KeywordTokenizerFactor is used. If I changed it to CJKTokenizer would that work? With an input of "Apple 电脑", what would CJKTokenizer produce? -is it "Apple", "电", "脑" ? or - is it "A", "p", "p", "l", "e", "电", "脑" ? Any help would be greatly appreciated. Andy