I've written a Chinese Analyzer for Lucene that uses a segmenter written by
Erik Peterson. However, as the author of the segmenter does not want his code
released under apache open source license (although his code _is_
opensource), I cannot place my work in the Lucene Sandbox.  This is
unfortunate, because I believe the analyzer works quite well in indexing and
searching chinese docs in GB2312 and UTF-8 encoding, and I like more people
to test, use, and confirm this.  So anyone who wants it, can have it. Just
shoot me an email.
BTW, I also have written an arabic analyzer, which is collecting dust for
similar reasons.
Good luck,

Ali Safarnejad


-----Original Message-----
From: Eric Chow [mailto:[EMAIL PROTECTED] 
Sent: 21 January 2005 11:42
To: Lucene Users List
Subject: Re: Search Chinese in Unicode !!!


Search not really correct with UTF-8 !!!


The following is the search result that I used the SearchFiles in the lucene
demo.

d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java
org.apache.lucene.demo.SearchFiles c:\temp\myindex
Usage: java SearchFiles <idnex>
Query: 經
Searching for: g                                <<<<<<<<<<<<      strange ??
3 total matching documents
0. ../docs/ChineseDemo.html            <<<<<<<<<<<<    this files contains
the 經
   -
1. ../docs/luceneplan.html
   - Jakarta Lucene - Plan for enhancements to Lucene
2. ../docs/api/index-all.html
   - Index (Lucene 1.4.3 API)
Query: 



>From the above result only the ChineseDemo.html includes the character that I
want to search !




The modified code in SearchFiles.java:


BufferedReader in = new BufferedReader(new InputStreamReader(System.in,
"UTF-8"));

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to