I tried with Standard Analyzer but not able to do so. Then I tried CJk Anlayzer given using CJK tokenizer but was again unsuccesful. The File in which text to be indexed is contains noth English and Japanese Characters. Can this be a problem.
Regards Ankur -----Original Message----- From: ? ? [mailto:[EMAIL PROTECTED] Sent: Friday, February 27, 2004 7:58 PM To: [EMAIL PROTECTED] Subject: Re: CJK Analyzer in lucene 1.3 final for east asian language without space for word segment in nature, the StandardTokenizer now is sigram based C1C2C3 ==> C1 C2 C3, so you search C1C2 and C2C1 will return same results CJKTokenizer is bigram based: C1C2C3 ==> C1C2 C2C3, so you it will result return when you search C2C1, briefly: CJKTotenizer is better than StandardTokenizer for CJK but I don't know how to implement bigram based token in StandartTokenzier. Che Dong http://www.chedong.com/tech/lucene.html >From: Erik Hatcher <[EMAIL PROTECTED]> >Reply-To: "Lucene Users List" <[EMAIL PROTECTED]> >To: "Lucene Users List" <[EMAIL PROTECTED]> >Subject: Re: CJK Analyzer in lucene 1.3 final >Date: Fri, 27 Feb 2004 08:29:10 -0500 >MIME-Version: 1.0 (Apple Message framework v612) >Received: from mail.apache.org ([208.185.179.12]) by mc11-f27.hotmail.com with Microsoft SMTPSVC(5.0.2195.6824); Fri, 27 Feb 2004 05:29:21 -0800 >Received: (qmail 58976 invoked by uid 500); 27 Feb 2004 13:29:16 -0000 >Received: (qmail 58962 invoked from network); 27 Feb 2004 13:29:15 -0000 >Received: from unknown (HELO c000.snv.cp.net) (209.228.32.77) by daedalus.apache.org with SMTP; 27 Feb 2004 13:29:15 -0000 >Received: (cpmta 24544 invoked from network); 27 Feb 2004 05:29:16 -0800 >Received: from 128.143.26.2 (HELO ?128.143.26.2?) by smtp.hatcher.net (209.228.32.77) with SMTP; 27 Feb 2004 05:29:16 -0800 >X-Message-Info: JGTYoYF78jEAnq90Su6PQLeCibywrZOE >Mailing-List: contact [EMAIL PROTECTED]; run by ezmlm >Precedence: bulk >List-Unsubscribe: <mailto:[EMAIL PROTECTED]> >List-Subscribe: <mailto:[EMAIL PROTECTED]> >List-Help: <mailto:[EMAIL PROTECTED]> >List-Post: <mailto:[EMAIL PROTECTED]> >List-Id: "Lucene Users List" <lucene-user.jakarta.apache.org> >Delivered-To: mailing list [EMAIL PROTECTED] >X-Sent: 27 Feb 2004 13:29:16 GMT >In-Reply-To: <[EMAIL PROTECTED]> >References: <[EMAIL PROTECTED]> >Message-Id: <[EMAIL PROTECTED]> >X-Mailer: Apple Mail (2.612) >X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N >Return-Path: [EMAIL PROTECTED] >X-OriginalArrivalTime: 27 Feb 2004 13:29:21.0631 (UTC) FILETIME=[B57A96F0:01C3FD35] > >On Feb 27, 2004, at 7:12 AM, Ankur Goel wrote: >> Hi, >>In the lucene-1.3-final version's CHANGES.txt it is written that >>"Fix >>StandardTokenizer's handling of CJK characters (Chinese, Japanese >>and Korean >>ideograms)." >> >>Does it mean that for CJK characters we now do not need to use any >>separate >>analyzer, standard analyzer will be sufficient?? > >You tell us. Does it work for you? > >An analyzer is a pretty personal decision based on your dataset, so >it is impossible to answer your question directly. > > Erik > > >--------------------------------------------------------------------- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] > _________________________________________________________________ 免费下载 MSN Explorer: http://explorer.msn.com/lccn/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]