RE: add CJKTokenizer to solr

Xuesong Luo Fri, 22 Jun 2007 10:04:46 -0700

Thanks, otis, I didn't know CJK is only used for Asian language. I'll try the 
German Analyzer.


-----Original Message-----
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 22, 2007 3:18 AM
To: solr-user@lucene.apache.org
Subject: Re: add CJKTokenizer to solr

I'm jumping in the middle of the thread here.
CJK = Chinese, Japanese, Korean
German = etwas ganz anderes
Why are you trying to use CJKAnalyzer+Tokenizer for German?  Have you tried 
German Analyzer from Lucene contrib?

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

----- Original Message ----
From: Xuesong Luo <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Friday, June 22, 2007 8:54:37 AM
Subject: RE: add CJKTokenizer to solr

Thanks, Toru and Chris,
I tried both the CJKTokenizer and CJKAnalyzer. Both return some unexpected 
highlight results when I tested with Germany. The field value I searched is 
"Ein Mann beißt den Hund".  The search criteria is beißt. 

When using CJKAnalyzer, beißt is treated as 2 single terms(bei and ß) the 
highlight result is: 
<str>Ein Mann <em>bei</em><em>ß</em>t den Hund</str> 

When using CJKTokenizer, beißt is treated as 3 single terms, the result is:
<str>Ein Mann <em>bei</em><em>ß</em><em>t</em> den Hund</str>

When using standard tokenizer, beißt is treated as a word, the result is:
<str>Ein Mann <em>beißt</em> den Hund</str>


I understand why the standard tokenizer treat beißt as a word, but don't know 
how CJKAnalyzer and CJKAnalyzer work, could anyone explain a little bit?


Thanks
Xuesong

-----Original Message-----
From: Toru Matsuzawa [mailto:[EMAIL PROTECTED] 
Sent: Monday, June 18, 2007 10:29 PM
To: solr-user@lucene.apache.org
Subject: Re: add CJKTokenizer to solr

I'm sorry. Because it was not possible to append it, 
it sends it again. 

> > I got the error below after adding CJKTokenizer to schema.xml.  I
> > checked the constructor of CJKTokenizer, it requires a Reader parameter,
> > I guess that's why I get this error, I searched the email archive, it
> > seems working for other users. Does anyone know what is the problem?
> 
> 
> CJKTokenizerFactory that I am using is appended.
> 
--
package org.apache.solr.analysis.ja;

import java.io.Reader;
import org.apache.lucene.analysis.cjk.CJKTokenizer ;

import org.apache.lucene.analysis.TokenStream;
import org.apache.solr.analysis.BaseTokenizerFactory;

/**
 * CJKTokenizer for Solr
 * @see org.apache.lucene.analysis.cjk.CJKTokenizer
 * @author matsu
 *
 */
public class CJKTokenizerFactory extends BaseTokenizerFactory {

  /**
   * @see org.apache.solr.analysis.TokenizerFactory#create(Reader)
   */
  public TokenStream create(Reader input) {
    return new CJKTokenizer( input );
  }

}


-- 
Trou Matsuzawa

RE: add CJKTokenizer to solr

Reply via email to