So, this is the normal N-gram one?  NGramTokenizerFactory

Digging deeper - there are actualy CJK and Chinese tokenizers in the
Solr codebase:

http://lucene.apache.org/solr/api/org/apache/solr/analysis/CJKTokenizerFactory.html
http://lucene.apache.org/solr/api/org/apache/solr/analysis/ChineseTokenizerFactory.html

The CJK one uses the lucene CJKTokenizer
http://lucene.apache.org/java/2_9_1/api/contrib-analyzers/org/apache/lucene/analysis/cjk/CJKTokenizer.html

and there seems to be another one even that no one has wrapped into Solr:
http://lucene.apache.org/java/2_9_1/api/contrib-smartcn/org/apache/lucene/analysis/cn/smart/package-summary.html

So seems like the existing options are a little better than I thought,
though it would be nice to have some docs on properly configuring
these.

-Peter

On Tue, Nov 10, 2009 at 6:05 PM, Otis Gospodnetic
<otis_gospodne...@yahoo.com> wrote:
> Peter,
>
> For CJK and n-grams, I think you don't want the *Edge* n-grams, but just 
> n-grams.
> Before you take the n-gram route, you may want to look at the smart Chinese 
> analyzer in Lucene contrib (I think it works only for Simplified Chinese) and 
> Sen (on java.net).  I also spotted a Korean analyzer in the wild a few months 
> back.
>
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
>
> ----- Original Message ----
>> From: Peter Wolanin <peter.wola...@acquia.com>
>> To: solr-user@lucene.apache.org
>> Sent: Tue, November 10, 2009 4:06:52 PM
>> Subject: any docs on solr.EdgeNGramFilterFactory?
>>
>> This fairly recent blog post:
>>
>> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
>>
>> describes the use of the solr.EdgeNGramFilterFactory as the tokenizer
>> for the index.  I don't see any mention of that tokenizer on the Solr
>> wiki - is it just waiting to be added, or is there any other
>> documentation in addition to the blog post?  In particular, there was
>> a thread last year about using an N-gram tokenizer to enable
>> reasonable (if not ideal) searching of CJK text, so I'd be curious to
>> know how people are configuring their schema (with this tokenizer?)
>> for that use case.
>>
>> Thanks,
>>
>> Peter
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> peter.wola...@acquia.com
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Reply via email to