Re: CJK Analyzer in lucene 1.3 final

车东 Fri, 27 Feb 2004 06:28:34 -0800

for east asian language without space for word segment in nature, the StandardTokenizer now is sigram based C1C2C3 ==> C1 C2 C3, so you search C1C2 and C2C1 will return same results

CJKTokenizer is bigram based: C1C2C3 ==> C1C2 C2C3, so you it will result return when you search C2C1, briefly: CJKTotenizer is better than StandardTokenizer for CJK but I don't know how to implement bigram based token in StandartTokenzier.

Che Dong
http://www.chedong.com/tech/lucene.html

From: Erik Hatcher <[EMAIL PROTECTED]> Reply-To: "Lucene Users List" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Subject: Re: CJK Analyzer in lucene 1.3 final Date: Fri, 27 Feb 2004 08:29:10 -0500 MIME-Version: 1.0 (Apple Message framework v612) Received: from mail.apache.org ([208.185.179.12]) by mc11-f27.hotmail.com

with Microsoft SMTPSVC(5.0.2195.6824); Fri, 27 Feb 2004 05:29:21 -0800

Received: (qmail 58976 invoked by uid 500); 27 Feb 2004 13:29:16 -0000 Received: (qmail 58962 invoked from network); 27 Feb 2004 13:29:15 -0000 Received: from unknown (HELO c000.snv.cp.net) (209.228.32.77) by

daedalus.apache.org with SMTP; 27 Feb 2004 13:29:15 -0000

Received: (cpmta 24544 invoked from network); 27 Feb 2004 05:29:16 -0800 Received: from 128.143.26.2 (HELO ?128.143.26.2?) by smtp.hatcher.net

(209.228.32.77) with SMTP; 27 Feb 2004 05:29:16 -0800

X-Message-Info: JGTYoYF78jEAnq90Su6PQLeCibywrZOE Mailing-List: contact [EMAIL PROTECTED]; run by ezmlm Precedence: bulk List-Unsubscribe: <mailto:[EMAIL PROTECTED]> List-Subscribe: <mailto:[EMAIL PROTECTED]> List-Help: <mailto:[EMAIL PROTECTED]> List-Post: <mailto:[EMAIL PROTECTED]> List-Id: "Lucene Users List" <lucene-user.jakarta.apache.org> Delivered-To: mailing list [EMAIL PROTECTED] X-Sent: 27 Feb 2004 13:29:16 GMT In-Reply-To: <[EMAIL PROTECTED]> References: <[EMAIL PROTECTED]> Message-Id: <[EMAIL PROTECTED]> X-Mailer: Apple Mail (2.612) X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N Return-Path:

[EMAIL PROTECTED]

X-OriginalArrivalTime: 27 Feb 2004 13:29:21.0631 (UTC)

FILETIME=[B57A96F0:01C3FD35]

On Feb 27, 2004, at 7:12 AM, Ankur Goel wrote:
Hi, In the lucene-1.3-final version's CHANGES.txt it is written that "Fix StandardTokenizer's handling of CJK characters (Chinese, Japanese and Korean ideograms)."

Does it mean that for CJK characters we now do not need to use any separate analyzer, standard analyzer will be sufficient??
You tell us. Does it work for you?

An analyzer is a pretty personal decision based on your dataset, so it is impossible to answer your question directly.

Erik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

_________________________________________________________________ 免费下载 MSN Explorer: http://explorer.msn.com/lccn/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: CJK Analyzer in lucene 1.3 final

Reply via email to