for east asian language without space for word segment in nature, the StandardTokenizer now is sigram based C1C2C3 ==> C1 C2 C3, so you search C1C2 and C2C1 will return same results

CJKTokenizer is bigram based: C1C2C3 ==> C1C2 C2C3, so you it will result return when you search C2C1,
briefly: CJKTotenizer is better than StandardTokenizer for CJK but I don't know how to implement bigram based token in StandartTokenzier.


Che Dong
http://www.chedong.com/tech/lucene.html

From: Erik Hatcher <[EMAIL PROTECTED]>
Reply-To: "Lucene Users List" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Subject: Re: CJK Analyzer in lucene 1.3 final
Date: Fri, 27 Feb 2004 08:29:10 -0500
MIME-Version: 1.0 (Apple Message framework v612)
Received: from mail.apache.org ([208.185.179.12]) by mc11-f27.hotmail.com
with Microsoft SMTPSVC(5.0.2195.6824); Fri, 27 Feb 2004 05:29:21 -0800
Received: (qmail 58976 invoked by uid 500); 27 Feb 2004 13:29:16 -0000
Received: (qmail 58962 invoked from network); 27 Feb 2004 13:29:15 -0000
Received: from unknown (HELO c000.snv.cp.net) (209.228.32.77) by
daedalus.apache.org with SMTP; 27 Feb 2004 13:29:15 -0000
Received: (cpmta 24544 invoked from network); 27 Feb 2004 05:29:16 -0800
Received: from 128.143.26.2 (HELO ?128.143.26.2?) by smtp.hatcher.net
(209.228.32.77) with SMTP; 27 Feb 2004 05:29:16 -0800
X-Message-Info: JGTYoYF78jEAnq90Su6PQLeCibywrZOE
Mailing-List: contact [EMAIL PROTECTED]; run by ezmlm
Precedence: bulk
List-Unsubscribe: <mailto:[EMAIL PROTECTED]>
List-Subscribe: <mailto:[EMAIL PROTECTED]>
List-Help: <mailto:[EMAIL PROTECTED]>
List-Post: <mailto:[EMAIL PROTECTED]>
List-Id: "Lucene Users List" <lucene-user.jakarta.apache.org>
Delivered-To: mailing list [EMAIL PROTECTED]
X-Sent: 27 Feb 2004 13:29:16 GMT
In-Reply-To: <[EMAIL PROTECTED]>
References: <[EMAIL PROTECTED]>
Message-Id: <[EMAIL PROTECTED]>
X-Mailer: Apple Mail (2.612)
X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N
Return-Path:
[EMAIL PROTECTED]
X-OriginalArrivalTime: 27 Feb 2004 13:29:21.0631 (UTC)
FILETIME=[B57A96F0:01C3FD35]

On Feb 27, 2004, at 7:12 AM, Ankur Goel wrote:
Hi,
In the lucene-1.3-final version's CHANGES.txt it is written that "Fix
StandardTokenizer's handling of CJK characters (Chinese, Japanese and Korean
ideograms)."


Does it mean that for CJK characters we now do not need to use any separate
analyzer, standard analyzer will be sufficient??

You tell us. Does it work for you?


An analyzer is a pretty personal decision based on your dataset, so it is impossible to answer your question directly.

Erik


--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]


_________________________________________________________________
免费下载 MSN Explorer: http://explorer.msn.com/lccn/



--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to