Re: any docs on solr.EdgeNGramFilterFactory?

Peter Wolanin Fri, 13 Nov 2009 15:23:58 -0800

Thanks for the link - there doesn't seem a be a fix version specified,
so I guess this will not officially ship with lucene 2.9?


-Peter

On Wed, Nov 11, 2009 at 10:36 PM, Robert Muir <rcm...@gmail.com> wrote:
> Peter, here is a project that does this:
> http://issues.apache.org/jira/browse/LUCENE-1488
>
>
>> That's kind of interesting - in general can I build a custom tokenizer
>> from existing tokenizers that treats different parts of the input
>> differently based on the utf-8 range of the characters?  E.g. use a
>> porter stemmer for stretches of Latin text and n-gram or something
>> else for CJK?
>>
>> -Peter
>>
>> On Tue, Nov 10, 2009 at 9:21 PM, Otis Gospodnetic
>> <otis_gospodne...@yahoo.com> wrote:
>> > Yes, that's the n-gram one.  I believe the existing CJK one in Lucene is
>> really just an n-gram tokenizer, so no different than the normal n-gram
>> tokenizer.
>> >
>> > Otis
>> > --
>> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>> >
>> >
>> >
>> > ----- Original Message ----
>> >> From: Peter Wolanin <peter.wola...@acquia.com>
>> >> To: solr-user@lucene.apache.org
>> >> Sent: Tue, November 10, 2009 7:34:37 PM
>> >> Subject: Re: any docs on solr.EdgeNGramFilterFactory?
>> >>
>> >> So, this is the normal N-gram one?  NGramTokenizerFactory
>> >>
>> >> Digging deeper - there are actualy CJK and Chinese tokenizers in the
>> >> Solr codebase:
>> >>
>> >>
>> http://lucene.apache.org/solr/api/org/apache/solr/analysis/CJKTokenizerFactory.html
>> >>
>> http://lucene.apache.org/solr/api/org/apache/solr/analysis/ChineseTokenizerFactory.html
>> >>
>> >> The CJK one uses the lucene CJKTokenizer
>> >>
>> http://lucene.apache.org/java/2_9_1/api/contrib-analyzers/org/apache/lucene/analysis/cjk/CJKTokenizer.html
>> >>
>> >> and there seems to be another one even that no one has wrapped into
>> Solr:
>> >>
>> http://lucene.apache.org/java/2_9_1/api/contrib-smartcn/org/apache/lucene/analysis/cn/smart/package-summary.html
>> >>
>> >> So seems like the existing options are a little better than I thought,
>> >> though it would be nice to have some docs on properly configuring
>> >> these.
>> >>
>> >> -Peter
>> >>
>> >> On Tue, Nov 10, 2009 at 6:05 PM, Otis Gospodnetic
>> >> wrote:
>> >> > Peter,
>> >> >
>> >> > For CJK and n-grams, I think you don't want the *Edge* n-grams, but
>> just
>> >> n-grams.
>> >> > Before you take the n-gram route, you may want to look at the smart
>> Chinese
>> >> analyzer in Lucene contrib (I think it works only for Simplified
>> Chinese) and
>> >> Sen (on java.net).  I also spotted a Korean analyzer in the wild a few
>> months
>> >> back.
>> >> >
>> >> > Otis
>> >> > --
>> >> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> >> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>> >> >
>> >> >
>> >> >
>> >> > ----- Original Message ----
>> >> >> From: Peter Wolanin
>> >> >> To: solr-user@lucene.apache.org
>> >> >> Sent: Tue, November 10, 2009 4:06:52 PM
>> >> >> Subject: any docs on solr.EdgeNGramFilterFactory?
>> >> >>
>> >> >> This fairly recent blog post:
>> >> >>
>> >> >>
>> >>
>> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
>> >> >>
>> >> >> describes the use of the solr.EdgeNGramFilterFactory as the tokenizer
>> >> >> for the index.  I don't see any mention of that tokenizer on the Solr
>> >> >> wiki - is it just waiting to be added, or is there any other
>> >> >> documentation in addition to the blog post?  In particular, there was
>> >> >> a thread last year about using an N-gram tokenizer to enable
>> >> >> reasonable (if not ideal) searching of CJK text, so I'd be curious to
>> >> >> know how people are configuring their schema (with this tokenizer?)
>> >> >> for that use case.
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >> >> Peter
>> >> >>
>> >> >> --
>> >> >> Peter M. Wolanin, Ph.D.
>> >> >> Momentum Specialist,  Acquia. Inc.
>> >> >> peter.wola...@acquia.com
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Peter M. Wolanin, Ph.D.
>> >> Momentum Specialist,  Acquia. Inc.
>> >> peter.wola...@acquia.com
>> >
>> >
>>
>>
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> peter.wola...@acquia.com
>>
>
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: any docs on solr.EdgeNGramFilterFactory?

Reply via email to