Peter, here is a project that does this:
http://issues.apache.org/jira/browse/LUCENE-1488


> That's kind of interesting - in general can I build a custom tokenizer
> from existing tokenizers that treats different parts of the input
> differently based on the utf-8 range of the characters?  E.g. use a
> porter stemmer for stretches of Latin text and n-gram or something
> else for CJK?
>
> -Peter
>
> On Tue, Nov 10, 2009 at 9:21 PM, Otis Gospodnetic
> <otis_gospodne...@yahoo.com> wrote:
> > Yes, that's the n-gram one.  I believe the existing CJK one in Lucene is
> really just an n-gram tokenizer, so no different than the normal n-gram
> tokenizer.
> >
> > Otis
> > --
> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> >
> >
> >
> > ----- Original Message ----
> >> From: Peter Wolanin <peter.wola...@acquia.com>
> >> To: solr-user@lucene.apache.org
> >> Sent: Tue, November 10, 2009 7:34:37 PM
> >> Subject: Re: any docs on solr.EdgeNGramFilterFactory?
> >>
> >> So, this is the normal N-gram one?  NGramTokenizerFactory
> >>
> >> Digging deeper - there are actualy CJK and Chinese tokenizers in the
> >> Solr codebase:
> >>
> >>
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/CJKTokenizerFactory.html
> >>
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/ChineseTokenizerFactory.html
> >>
> >> The CJK one uses the lucene CJKTokenizer
> >>
> http://lucene.apache.org/java/2_9_1/api/contrib-analyzers/org/apache/lucene/analysis/cjk/CJKTokenizer.html
> >>
> >> and there seems to be another one even that no one has wrapped into
> Solr:
> >>
> http://lucene.apache.org/java/2_9_1/api/contrib-smartcn/org/apache/lucene/analysis/cn/smart/package-summary.html
> >>
> >> So seems like the existing options are a little better than I thought,
> >> though it would be nice to have some docs on properly configuring
> >> these.
> >>
> >> -Peter
> >>
> >> On Tue, Nov 10, 2009 at 6:05 PM, Otis Gospodnetic
> >> wrote:
> >> > Peter,
> >> >
> >> > For CJK and n-grams, I think you don't want the *Edge* n-grams, but
> just
> >> n-grams.
> >> > Before you take the n-gram route, you may want to look at the smart
> Chinese
> >> analyzer in Lucene contrib (I think it works only for Simplified
> Chinese) and
> >> Sen (on java.net).  I also spotted a Korean analyzer in the wild a few
> months
> >> back.
> >> >
> >> > Otis
> >> > --
> >> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> >> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> >> >
> >> >
> >> >
> >> > ----- Original Message ----
> >> >> From: Peter Wolanin
> >> >> To: solr-user@lucene.apache.org
> >> >> Sent: Tue, November 10, 2009 4:06:52 PM
> >> >> Subject: any docs on solr.EdgeNGramFilterFactory?
> >> >>
> >> >> This fairly recent blog post:
> >> >>
> >> >>
> >>
> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
> >> >>
> >> >> describes the use of the solr.EdgeNGramFilterFactory as the tokenizer
> >> >> for the index.  I don't see any mention of that tokenizer on the Solr
> >> >> wiki - is it just waiting to be added, or is there any other
> >> >> documentation in addition to the blog post?  In particular, there was
> >> >> a thread last year about using an N-gram tokenizer to enable
> >> >> reasonable (if not ideal) searching of CJK text, so I'd be curious to
> >> >> know how people are configuring their schema (with this tokenizer?)
> >> >> for that use case.
> >> >>
> >> >> Thanks,
> >> >>
> >> >> Peter
> >> >>
> >> >> --
> >> >> Peter M. Wolanin, Ph.D.
> >> >> Momentum Specialist,  Acquia. Inc.
> >> >> peter.wola...@acquia.com
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Peter M. Wolanin, Ph.D.
> >> Momentum Specialist,  Acquia. Inc.
> >> peter.wola...@acquia.com
> >
> >
>
>
>
> --
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> peter.wola...@acquia.com
>




-- 
Robert Muir
rcm...@gmail.com

Reply via email to