RE: CJK Analyzer in lucene 1.3 final

Ankur Goel Fri, 27 Feb 2004 08:12:19 -0800


I tried with Standard Analyzer but not able to do so. Then I tried CJk
Anlayzer given using CJK tokenizer but was again unsuccesful. The File in
which text to be indexed is contains noth English and Japanese Characters.
Can this be a problem.


Regards
Ankur 

-----Original Message-----
From: ? ? [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 27, 2004 7:58 PM
To: [EMAIL PROTECTED]
Subject: Re: CJK Analyzer in lucene 1.3 final

for east asian language without space for word segment in nature, the 
StandardTokenizer now is sigram based C1C2C3 ==> C1 C2 C3, so you search 
C1C2 and C2C1 will return same results

CJKTokenizer is bigram based: C1C2C3 ==> C1C2 C2C3, so you it will result 
return when you search C2C1,
briefly: CJKTotenizer is better than StandardTokenizer for CJK but I don't 
know how to implement bigram based token in StandartTokenzier.

Che Dong
http://www.chedong.com/tech/lucene.html

>From: Erik Hatcher <[EMAIL PROTECTED]>
>Reply-To: "Lucene Users List" <[EMAIL PROTECTED]>
>To: "Lucene Users List" <[EMAIL PROTECTED]>
>Subject: Re: CJK Analyzer in lucene 1.3 final
>Date: Fri, 27 Feb 2004 08:29:10 -0500
>MIME-Version: 1.0 (Apple Message framework v612)
>Received: from mail.apache.org ([208.185.179.12]) by mc11-f27.hotmail.com 
with Microsoft SMTPSVC(5.0.2195.6824); Fri, 27 Feb 2004 05:29:21 -0800
>Received: (qmail 58976 invoked by uid 500); 27 Feb 2004 13:29:16 -0000
>Received: (qmail 58962 invoked from network); 27 Feb 2004 13:29:15 -0000
>Received: from unknown (HELO c000.snv.cp.net) (209.228.32.77)  by 
daedalus.apache.org with SMTP; 27 Feb 2004 13:29:15 -0000
>Received: (cpmta 24544 invoked from network); 27 Feb 2004 05:29:16 -0800
>Received: from 128.143.26.2 (HELO ?128.143.26.2?)  by smtp.hatcher.net 
(209.228.32.77) with SMTP; 27 Feb 2004 05:29:16 -0800
>X-Message-Info: JGTYoYF78jEAnq90Su6PQLeCibywrZOE
>Mailing-List: contact [EMAIL PROTECTED]; run by ezmlm
>Precedence: bulk
>List-Unsubscribe: <mailto:[EMAIL PROTECTED]>
>List-Subscribe: <mailto:[EMAIL PROTECTED]>
>List-Help: <mailto:[EMAIL PROTECTED]>
>List-Post: <mailto:[EMAIL PROTECTED]>
>List-Id: "Lucene Users List" <lucene-user.jakarta.apache.org>
>Delivered-To: mailing list [EMAIL PROTECTED]
>X-Sent: 27 Feb 2004 13:29:16 GMT
>In-Reply-To: <[EMAIL PROTECTED]>
>References: <[EMAIL PROTECTED]>
>Message-Id: <[EMAIL PROTECTED]>
>X-Mailer: Apple Mail (2.612)
>X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N
>Return-Path: 
[EMAIL PROTECTED]
>X-OriginalArrivalTime: 27 Feb 2004 13:29:21.0631 (UTC) 
FILETIME=[B57A96F0:01C3FD35]
>
>On Feb 27, 2004, at 7:12 AM, Ankur Goel wrote:
>>  Hi,
>>In the lucene-1.3-final version's CHANGES.txt it is written that 
>>"Fix
>>StandardTokenizer's handling of CJK characters (Chinese, Japanese 
>>and Korean
>>ideograms)."
>>
>>Does it mean that for CJK characters we now do not need to use any 
>>separate
>>analyzer, standard analyzer will be sufficient??
>
>You tell us.  Does it work for you?
>
>An analyzer is a pretty personal decision based on your dataset, so 
>it is impossible to answer your question directly.
>
>       Erik
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>

_________________________________________________________________
免费下载 MSN Explorer:   http://explorer.msn.com/lccn/  


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: CJK Analyzer in lucene 1.3 final

Reply via email to