Peter, here is a project that does this: http://issues.apache.org/jira/browse/LUCENE-1488
> That's kind of interesting - in general can I build a custom tokenizer > from existing tokenizers that treats different parts of the input > differently based on the utf-8 range of the characters? E.g. use a > porter stemmer for stretches of Latin text and n-gram or something > else for CJK? > > -Peter > > On Tue, Nov 10, 2009 at 9:21 PM, Otis Gospodnetic > <otis_gospodne...@yahoo.com> wrote: > > Yes, that's the n-gram one. I believe the existing CJK one in Lucene is > really just an n-gram tokenizer, so no different than the normal n-gram > tokenizer. > > > > Otis > > -- > > Sematext is hiring -- http://sematext.com/about/jobs.html?mls > > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR > > > > > > > > ----- Original Message ---- > >> From: Peter Wolanin <peter.wola...@acquia.com> > >> To: solr-user@lucene.apache.org > >> Sent: Tue, November 10, 2009 7:34:37 PM > >> Subject: Re: any docs on solr.EdgeNGramFilterFactory? > >> > >> So, this is the normal N-gram one? NGramTokenizerFactory > >> > >> Digging deeper - there are actualy CJK and Chinese tokenizers in the > >> Solr codebase: > >> > >> > http://lucene.apache.org/solr/api/org/apache/solr/analysis/CJKTokenizerFactory.html > >> > http://lucene.apache.org/solr/api/org/apache/solr/analysis/ChineseTokenizerFactory.html > >> > >> The CJK one uses the lucene CJKTokenizer > >> > http://lucene.apache.org/java/2_9_1/api/contrib-analyzers/org/apache/lucene/analysis/cjk/CJKTokenizer.html > >> > >> and there seems to be another one even that no one has wrapped into > Solr: > >> > http://lucene.apache.org/java/2_9_1/api/contrib-smartcn/org/apache/lucene/analysis/cn/smart/package-summary.html > >> > >> So seems like the existing options are a little better than I thought, > >> though it would be nice to have some docs on properly configuring > >> these. > >> > >> -Peter > >> > >> On Tue, Nov 10, 2009 at 6:05 PM, Otis Gospodnetic > >> wrote: > >> > Peter, > >> > > >> > For CJK and n-grams, I think you don't want the *Edge* n-grams, but > just > >> n-grams. > >> > Before you take the n-gram route, you may want to look at the smart > Chinese > >> analyzer in Lucene contrib (I think it works only for Simplified > Chinese) and > >> Sen (on java.net). I also spotted a Korean analyzer in the wild a few > months > >> back. > >> > > >> > Otis > >> > -- > >> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls > >> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR > >> > > >> > > >> > > >> > ----- Original Message ---- > >> >> From: Peter Wolanin > >> >> To: solr-user@lucene.apache.org > >> >> Sent: Tue, November 10, 2009 4:06:52 PM > >> >> Subject: any docs on solr.EdgeNGramFilterFactory? > >> >> > >> >> This fairly recent blog post: > >> >> > >> >> > >> > http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/ > >> >> > >> >> describes the use of the solr.EdgeNGramFilterFactory as the tokenizer > >> >> for the index. I don't see any mention of that tokenizer on the Solr > >> >> wiki - is it just waiting to be added, or is there any other > >> >> documentation in addition to the blog post? In particular, there was > >> >> a thread last year about using an N-gram tokenizer to enable > >> >> reasonable (if not ideal) searching of CJK text, so I'd be curious to > >> >> know how people are configuring their schema (with this tokenizer?) > >> >> for that use case. > >> >> > >> >> Thanks, > >> >> > >> >> Peter > >> >> > >> >> -- > >> >> Peter M. Wolanin, Ph.D. > >> >> Momentum Specialist, Acquia. Inc. > >> >> peter.wola...@acquia.com > >> > > >> > > >> > >> > >> > >> -- > >> Peter M. Wolanin, Ph.D. > >> Momentum Specialist, Acquia. Inc. > >> peter.wola...@acquia.com > > > > > > > > -- > Peter M. Wolanin, Ph.D. > Momentum Specialist, Acquia. Inc. > peter.wola...@acquia.com > -- Robert Muir rcm...@gmail.com