So, this is the normal N-gram one? NGramTokenizerFactory Digging deeper - there are actualy CJK and Chinese tokenizers in the Solr codebase:
http://lucene.apache.org/solr/api/org/apache/solr/analysis/CJKTokenizerFactory.html http://lucene.apache.org/solr/api/org/apache/solr/analysis/ChineseTokenizerFactory.html The CJK one uses the lucene CJKTokenizer http://lucene.apache.org/java/2_9_1/api/contrib-analyzers/org/apache/lucene/analysis/cjk/CJKTokenizer.html and there seems to be another one even that no one has wrapped into Solr: http://lucene.apache.org/java/2_9_1/api/contrib-smartcn/org/apache/lucene/analysis/cn/smart/package-summary.html So seems like the existing options are a little better than I thought, though it would be nice to have some docs on properly configuring these. -Peter On Tue, Nov 10, 2009 at 6:05 PM, Otis Gospodnetic <otis_gospodne...@yahoo.com> wrote: > Peter, > > For CJK and n-grams, I think you don't want the *Edge* n-grams, but just > n-grams. > Before you take the n-gram route, you may want to look at the smart Chinese > analyzer in Lucene contrib (I think it works only for Simplified Chinese) and > Sen (on java.net). I also spotted a Korean analyzer in the wild a few months > back. > > Otis > -- > Sematext is hiring -- http://sematext.com/about/jobs.html?mls > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR > > > > ----- Original Message ---- >> From: Peter Wolanin <peter.wola...@acquia.com> >> To: solr-user@lucene.apache.org >> Sent: Tue, November 10, 2009 4:06:52 PM >> Subject: any docs on solr.EdgeNGramFilterFactory? >> >> This fairly recent blog post: >> >> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ >> >> describes the use of the solr.EdgeNGramFilterFactory as the tokenizer >> for the index. I don't see any mention of that tokenizer on the Solr >> wiki - is it just waiting to be added, or is there any other >> documentation in addition to the blog post? In particular, there was >> a thread last year about using an N-gram tokenizer to enable >> reasonable (if not ideal) searching of CJK text, so I'd be curious to >> know how people are configuring their schema (with this tokenizer?) >> for that use case. >> >> Thanks, >> >> Peter >> >> -- >> Peter M. Wolanin, Ph.D. >> Momentum Specialist, Acquia. Inc. >> peter.wola...@acquia.com > > -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com