I've written a Chinese Analyzer for Lucene that uses a segmenter written by
Erik Peterson. However, as the author of the segmenter does not want his code
released under apache open source license (although his code _is_
opensource), I cannot place my work in the Lucene Sandbox. This is
unfortunate, because I believe the analyzer works quite well in indexing and
searching chinese docs in GB2312 and UTF-8 encoding, and I like more people
to test, use, and confirm this. So anyone who wants it, can have it. Just
shoot me an email.
BTW, I also have written an arabic analyzer, which is collecting dust for
similar reasons.
Good luck,
Ali Safarnejad
-----Original Message-----
From: Eric Chow [mailto:[EMAIL PROTECTED]
Sent: 21 January 2005 11:42
To: Lucene Users List
Subject: Re: Search Chinese in Unicode !!!
Search not really correct with UTF-8 !!!
The following is the search result that I used the SearchFiles in the lucene
demo.
d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java
org.apache.lucene.demo.SearchFiles c:\temp\myindex
Usage: java SearchFiles <idnex>
Query: 經
Searching for: g <<<<<<<<<<<< strange ??
3 total matching documents
0. ../docs/ChineseDemo.html <<<<<<<<<<<< this files contains
the 經
-
1. ../docs/luceneplan.html
- Jakarta Lucene - Plan for enhancements to Lucene
2. ../docs/api/index-all.html
- Index (Lucene 1.4.3 API)
Query:
>From the above result only the ChineseDemo.html includes the character that I
want to search !
The modified code in SearchFiles.java:
BufferedReader in = new BufferedReader(new InputStreamReader(System.in,
"UTF-8"));
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]