Re: Search Chinese in Unicode !!!

2005-01-26 Thread "René Hackl"
> > The modified code in SearchFiles.java:
> > 
> > 
> > BufferedReader in = new BufferedReader(new
> > InputStreamReader(System.in, "UTF-8"));

It might make sense to incorporate a similar change in WordlistLoader.
Instead of 

freader = new FileReader(wordfile);
lnr = new LineNumberReader(freader);

I think it's preferable to do something like

LineNumberReader lnr = new LineNumberReader(new InputStreamReader(
new FileInputStream(wordfile), "UTF-8"));

to load even more languages' files, now that it resides in the
analysis-package.

Regards,
René

-- 
10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
+++ GMX - die erste Adresse für Mail, Message, More +++

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Chinese in Unicode !!!

2005-01-25 Thread Otis Gospodnetic
I don't have a document with chinese characters to verify this, but it
looks right, so I'll add your change to SearchFiles.java.

Thanks,
Otis

--- Eric Chow <[EMAIL PROTECTED]> wrote:

> Search not really correct with UTF-8 !!!
> 
> 
> The following is the search result that I used the SearchFiles in the
> lucene demo.
> 
> d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java
> org.apache.lucene.demo.SearchFiles c:\temp\myindex
> Usage: java SearchFiles 
> Query: 經
> Searching for: g 
> strange ??
> 3 total matching documents
> 0. ../docs/ChineseDemo.htmlthis files
> contains the 經
>-
> 1. ../docs/luceneplan.html
>- Jakarta Lucene - Plan for enhancements to Lucene
> 2. ../docs/api/index-all.html
>- Index (Lucene 1.4.3 API)
> Query: 
> 
> 
> 
> From the above result only the ChineseDemo.html includes the
> character
> that I want to search !
> 
> 
> 
> 
> The modified code in SearchFiles.java:
> 
> 
> BufferedReader in = new BufferedReader(new
> InputStreamReader(System.in, "UTF-8"));
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Chinese in Unicode !!!

2005-01-22 Thread ansi
hi,Safarnejad
would you pls send me a copy of your code?
zhousp#gmail.com

thanks:)


On Fri, 21 Jan 2005 17:36:17 +0100, Safarnejad, Ali (AFIS)
<[EMAIL PROTECTED]> wrote:
> I've written a Chinese Analyzer for Lucene that uses a segmenter written by
> Erik Peterson. However, as the author of the segmenter does not want his code
> released under apache open source license (although his code _is_
> opensource), I cannot place my work in the Lucene Sandbox.  This is
> unfortunate, because I believe the analyzer works quite well in indexing and
> searching chinese docs in GB2312 and UTF-8 encoding, and I like more people
> to test, use, and confirm this.  So anyone who wants it, can have it. Just
> shoot me an email.
> BTW, I also have written an arabic analyzer, which is collecting dust for
> similar reasons.
> Good luck,
> 
> Ali Safarnejad
> 
> 
> -Original Message-
> From: Eric Chow [mailto:[EMAIL PROTECTED]
> Sent: 21 January 2005 11:42
> To: Lucene Users List
> Subject: Re: Search Chinese in Unicode !!!
> 
> Search not really correct with UTF-8 !!!
> 
> The following is the search result that I used the SearchFiles in the lucene
> demo.
> 
> d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java
> org.apache.lucene.demo.SearchFiles c:\temp\myindex
> Usage: java SearchFiles 
> Query: ç
> Searching for: g<<<<<<<<<<<<  strange ??
> 3 total matching documents
> 0. ../docs/ChineseDemo.html<<<<<<<<<<<<this files contains
> the ç
>   -
> 1. ../docs/luceneplan.html
>   - Jakarta Lucene - Plan for enhancements to Lucene
> 2. ../docs/api/index-all.html
>   - Index (Lucene 1.4.3 API)
> Query:
> 
> From the above result only the ChineseDemo.html includes the character that I
> want to search !
> 
> The modified code in SearchFiles.java:
> 
> BufferedReader in = new BufferedReader(new InputStreamReader(System.in,
> "UTF-8"));
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-- 
---
This mail is for maillist only.
Any private mail pls send to [EMAIL PROTECTED]
-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Chinese in Unicode !!!

2005-01-22 Thread 田春峰
hi, Eric
 
If you can read chinese directly , Please reference to this blog:
http://blog.csdn.net/accesine960
or, search weblucene at www.sf.net which is a project based upon lucene by a 
chinese, name : chedong , his web site is : www.chedong.com 
 
good luck

Eric Chow <[EMAIL PROTECTED]> wrote:
How to create index with chinese (in utf-8 encoding ) HTML and search
with Lucene ?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



 
自动签名:
请使用机器人服务:
msn机器人:  [EMAIL PROTECTED]
qq机器人: 443803193







-
Do You Yahoo!?
150万曲MP3疯狂搜,带您闯入音乐殿堂
美女明星应有尽有,搜遍美图、艳图和酷图
1G就是1000兆,雅虎电邮自助扩容!

Re: Search Chinese in Unicode !!!

2005-01-21 Thread Eric Chow
I want that Chinese Anayzer !!


On Fri, 21 Jan 2005 17:36:17 +0100, Safarnejad, Ali (AFIS)
<[EMAIL PROTECTED]> wrote:
> I've written a Chinese Analyzer for Lucene that uses a segmenter written by
> Erik Peterson. However, as the author of the segmenter does not want his code
> released under apache open source license (although his code _is_
> opensource), I cannot place my work in the Lucene Sandbox.  This is
> unfortunate, because I believe the analyzer works quite well in indexing and
> searching chinese docs in GB2312 and UTF-8 encoding, and I like more people
> to test, use, and confirm this.  So anyone who wants it, can have it. Just
> shoot me an email.
> BTW, I also have written an arabic analyzer, which is collecting dust for
> similar reasons.
> Good luck,
> 
> Ali Safarnejad
> 
> 
> -Original Message-
> From: Eric Chow [mailto:[EMAIL PROTECTED]
> Sent: 21 January 2005 11:42
> To: Lucene Users List
> Subject: Re: Search Chinese in Unicode !!!
> 
> Search not really correct with UTF-8 !!!
> 
> The following is the search result that I used the SearchFiles in the lucene
> demo.
> 
> d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java
> org.apache.lucene.demo.SearchFiles c:\temp\myindex
> Usage: java SearchFiles 
> Query: ç
> Searching for: g<<<<<<<<<<<<  strange ??
> 3 total matching documents
> 0. ../docs/ChineseDemo.html<<<<<<<<<<<<this files contains
> the ç
>   -
> 1. ../docs/luceneplan.html
>   - Jakarta Lucene - Plan for enhancements to Lucene
> 2. ../docs/api/index-all.html
>   - Index (Lucene 1.4.3 API)
> Query:
> 
> From the above result only the ChineseDemo.html includes the character that I
> want to search !
> 
> The modified code in SearchFiles.java:
> 
> BufferedReader in = new BufferedReader(new InputStreamReader(System.in,
> "UTF-8"));
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Chinese in Unicode !!!

2005-01-21 Thread jian chen
Hi,

I have some studies in Chinese text search. The main problem is how to
separate the words. As in Chinese, there is no white space between
words.

The typical commercial search engines these days use a dictionary
based approach. That is, look through the Chinese text and find the
words that are in the dictionary. As for those characters that do not
match words in the dictionary, you could use bi-gram based approach.
Say, a b c,  you could index as 2 (pseudo) words, ab, bc.

I think pure bi-gram based approach is not good for relative large
Chinese text collection, as you end up with many pseudo terms that are
not actual words.

Cheers,

Jian

On Fri, 21 Jan 2005 18:55:56 +0100, Safarnejad, Ali (AFIS)
<[EMAIL PROTECTED]> wrote:
> The ChineseAnalyzer tokenizes based on some english stopwords.  The
> CJKAnalzyer is not much more sophisticated for Chinese Analysis (2 byte
> tokenizing).  The analyzer I just sent you (using Erik Peterson's
> segmenter:), looks up three dictionaries to segment the chinese text, based
> on real word matches.
> 
> 
> -Original Message-
> From: news [mailto:[EMAIL PROTECTED] On Behalf Of aurora
> Sent: 21 January 2005 18:29
> To: lucene-user@jakarta.apache.org
> Subject: Re: Search Chinese in Unicode !!!
> 
> I would love to give it a try. Please email me at aurora00 at gmail.com.
> Thanks!
> 
> Also what is the opinion on the CJKAnalyzer and ChineseAnalyzer? Some
> people actually said the StandardAnalyzer works better. I wonder what's
> the pros and cons.
> 
> > I've written a Chinese Analyzer for Lucene that uses a segmenter
> > written
> > by
> > Erik Peterson. However, as the author of the segmenter does not want his
> > code
> > released under apache open source license (although his code _is_
> > opensource), I cannot place my work in the Lucene Sandbox.  This is
> > unfortunate, because I believe the analyzer works quite well in indexing
> > and
> > searching chinese docs in GB2312 and UTF-8 encoding, and I like more
> > people
> > to test, use, and confirm this.  So anyone who wants it, can have it.
> > Just
> > shoot me an email.
> > BTW, I also have written an arabic analyzer, which is collecting dust for
> > similar reasons.
> > Good luck,
> >
> > Ali Safarnejad
> >
> >
> > -Original Message-
> > From: Eric Chow [mailto:[EMAIL PROTECTED]
> > Sent: 21 January 2005 11:42
> > To: Lucene Users List
> > Subject: Re: Search Chinese in Unicode !!!
> >
> >
> > Search not really correct with UTF-8 !!!
> >
> >
> > The following is the search result that I used the SearchFiles in the
> > lucene
> > demo.
> >
> > d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java
> > org.apache.lucene.demo.SearchFiles c:\temp\myindex
> > Usage: java SearchFiles 
> > Query: ç
> > Searching for: g<<<<<<<<<<<<
> > strange ??
> > 3 total matching documents
> > 0. ../docs/ChineseDemo.html<<<<<<<<<<<<this files
> > contains
> > the ç
> >-
> > 1. ../docs/luceneplan.html
> >- Jakarta Lucene - Plan for enhancements to Lucene
> > 2. ../docs/api/index-all.html
> >- Index (Lucene 1.4.3 API)
> > Query:
> >
> >
> >
> > From the above result only the ChineseDemo.html includes the character
> > that I
> > want to search !
> >
> >
> >
> >
> > The modified code in SearchFiles.java:
> >
> >
> > BufferedReader in = new BufferedReader(new InputStreamReader(System.in,
> > "UTF-8"));
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> 
> --
> Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Search Chinese in Unicode !!!

2005-01-21 Thread Safarnejad, Ali (AFIS)
The ChineseAnalyzer tokenizes based on some english stopwords.  The
(BCJKAnalzyer is not much more sophisticated for Chinese Analysis (2 byte
(Btokenizing).  The analyzer I just sent you (using Erik Peterson's
(Bsegmenter:), looks up three dictionaries to segment the chinese text, based
(Bon real word matches.
(B
(B
(B-Original Message-
(BFrom: news [mailto:[EMAIL PROTECTED] On Behalf Of aurora
(BSent: 21 January 2005 18:29
(BTo: lucene-user@jakarta.apache.org
(BSubject: Re: Search Chinese in Unicode !!!
(B
(B
(BI would love to give it a try. Please email me at aurora00 at gmail.com.  
(BThanks!
(B
(BAlso what is the opinion on the CJKAnalyzer and ChineseAnalyzer? Some  
(Bpeople actually said the StandardAnalyzer works better. I wonder what's  
(Bthe pros and cons.
(B
(B
(B
(B> I've written a Chinese Analyzer for Lucene that uses a segmenter 
(B> written
(B> by
(B> Erik Peterson. However, as the author of the segmenter does not want his  
(B> code
(B> released under apache open source license (although his code _is_
(B> opensource), I cannot place my work in the Lucene Sandbox.  This is
(B> unfortunate, because I believe the analyzer works quite well in indexing  
(B> and
(B> searching chinese docs in GB2312 and UTF-8 encoding, and I like more  
(B> people
(B> to test, use, and confirm this.  So anyone who wants it, can have it.  
(B> Just
(B> shoot me an email.
(B> BTW, I also have written an arabic analyzer, which is collecting dust for
(B> similar reasons.
(B> Good luck,
(B>
(B> Ali Safarnejad
(B>
(B>
(B> -Original Message-
(B> From: Eric Chow [mailto:[EMAIL PROTECTED]
(B> Sent: 21 January 2005 11:42
(B> To: Lucene Users List
(B> Subject: Re: Search Chinese in Unicode !!!
(B>
(B>
(B> Search not really correct with UTF-8 !!!
(B>
(B>
(B> The following is the search result that I used the SearchFiles in the
(B> lucene
(B> demo.
(B>
(B> d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java
(B> org.apache.lucene.demo.SearchFiles c:\temp\myindex
(B> Usage: java SearchFiles 
(B> Query: $Be4(J
(B> Searching for: g<<<<<<<<<<<<   
(B> strange ??
(B> 3 total matching documents
(B> 0. ../docs/ChineseDemo.html<<<<<<<<<<<<this files  
(B> contains
(B> the $Be4(J
(B>-
(B> 1. ../docs/luceneplan.html
(B>- Jakarta Lucene - Plan for enhancements to Lucene
(B> 2. ../docs/api/index-all.html
(B>- Index (Lucene 1.4.3 API)
(B> Query:
(B>
(B>
(B>
(B> From the above result only the ChineseDemo.html includes the character
(B> that I
(B> want to search !
(B>
(B>
(B>
(B>
(B> The modified code in SearchFiles.java:
(B>
(B>
(B> BufferedReader in = new BufferedReader(new InputStreamReader(System.in,
(B> "UTF-8"));
(B>
(B> -
(B> To unsubscribe, e-mail: [EMAIL PROTECTED]
(B> For additional commands, e-mail: [EMAIL PROTECTED]
(B
(B
(B
(B-- 
(BUsing Opera's revolutionary e-mail client: http://www.opera.com/m2/
(B
(B
(B-
(BTo unsubscribe, e-mail: [EMAIL PROTECTED]
(BFor additional commands, e-mail: [EMAIL PROTECTED]
(B
(B
(B-
(BTo unsubscribe, e-mail: [EMAIL PROTECTED]
(BFor additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Chinese in Unicode !!!

2005-01-21 Thread aurora
I would love to give it a try. Please email me at aurora00 at gmail.com.  
Thanks!

Also what is the opinion on the CJKAnalyzer and ChineseAnalyzer? Some  
people actually said the StandardAnalyzer works better. I wonder what's  
the pros and cons.


I've written a Chinese Analyzer for Lucene that uses a segmenter written  
by
Erik Peterson. However, as the author of the segmenter does not want his  
code
released under apache open source license (although his code _is_
opensource), I cannot place my work in the Lucene Sandbox.  This is
unfortunate, because I believe the analyzer works quite well in indexing  
and
searching chinese docs in GB2312 and UTF-8 encoding, and I like more  
people
to test, use, and confirm this.  So anyone who wants it, can have it.  
Just
shoot me an email.
BTW, I also have written an arabic analyzer, which is collecting dust for
similar reasons.
Good luck,

Ali Safarnejad
-Original Message-
From: Eric Chow [mailto:[EMAIL PROTECTED]
Sent: 21 January 2005 11:42
To: Lucene Users List
Subject: Re: Search Chinese in Unicode !!!
Search not really correct with UTF-8 !!!
The following is the search result that I used the SearchFiles in the  
lucene
demo.

d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java
org.apache.lucene.demo.SearchFiles c:\temp\myindex
Usage: java SearchFiles 
Query: ç
Searching for: g<<<<<<<<<<<<   
strange ??
3 total matching documents
0. ../docs/ChineseDemo.html<<<<<<<<<<<<this files  
contains
the ç
   -
1. ../docs/luceneplan.html
   - Jakarta Lucene - Plan for enhancements to Lucene
2. ../docs/api/index-all.html
   - Index (Lucene 1.4.3 API)
Query:


From the above result only the ChineseDemo.html includes the character  
that I
want to search !


The modified code in SearchFiles.java:
BufferedReader in = new BufferedReader(new InputStreamReader(System.in,
"UTF-8"));
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Search Chinese in Unicode !!!

2005-01-21 Thread Otis Gospodnetic
If you are hosting the code somewhere (e.g. your site, SF, java.net,
etc.), we should link to them from one of the Lucene pages where we
link to related external tools, apps, and such.

Otis


--- "Safarnejad, Ali (AFIS)" <[EMAIL PROTECTED]> wrote:

> I've written a Chinese Analyzer for Lucene that uses a segmenter
> written by
> Erik Peterson. However, as the author of the segmenter does not want
> his code
> released under apache open source license (although his code _is_
> opensource), I cannot place my work in the Lucene Sandbox.  This is
> unfortunate, because I believe the analyzer works quite well in
> indexing and
> searching chinese docs in GB2312 and UTF-8 encoding, and I like more
> people
> to test, use, and confirm this.  So anyone who wants it, can have it.
> Just
> shoot me an email.
> BTW, I also have written an arabic analyzer, which is collecting dust
> for
> similar reasons.
> Good luck,
> 
> Ali Safarnejad
> 
> 
> -Original Message-
> From: Eric Chow [mailto:[EMAIL PROTECTED] 
> Sent: 21 January 2005 11:42
> To: Lucene Users List
> Subject: Re: Search Chinese in Unicode !!!
> 
> 
> Search not really correct with UTF-8 !!!
> 
> 
> The following is the search result that I used the SearchFiles in the
> lucene
> demo.
> 
> d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java
> org.apache.lucene.demo.SearchFiles c:\temp\myindex
> Usage: java SearchFiles 
> Query: å´
> Searching for: g<<<<<<<<<<<< 
> strange ??
> 3 total matching documents
> 0. ../docs/ChineseDemo.html<<<<<<<<<<<<this files
> contains
> the å´
>-
> 1. ../docs/luceneplan.html
>- Jakarta Lucene - Plan for enhancements to Lucene
> 2. ../docs/api/index-all.html
>- Index (Lucene 1.4.3 API)
> Query: 
> 
> 
> 
> From the above result only the ChineseDemo.html includes the
> character that I
> want to search !
> 
> 
> 
> 
> The modified code in SearchFiles.java:
> 
> 
> BufferedReader in = new BufferedReader(new
> InputStreamReader(System.in,
> "UTF-8"));
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Search Chinese in Unicode !!!

2005-01-21 Thread Safarnejad, Ali (AFIS)
I've written a Chinese Analyzer for Lucene that uses a segmenter written by
(BErik Peterson. However, as the author of the segmenter does not want his code
(Breleased under apache open source license (although his code _is_
(Bopensource), I cannot place my work in the Lucene Sandbox.  This is
(Bunfortunate, because I believe the analyzer works quite well in indexing and
(Bsearching chinese docs in GB2312 and UTF-8 encoding, and I like more people
(Bto test, use, and confirm this.  So anyone who wants it, can have it. Just
(Bshoot me an email.
(BBTW, I also have written an arabic analyzer, which is collecting dust for
(Bsimilar reasons.
(BGood luck,
(B
(BAli Safarnejad
(B
(B
(B-Original Message-
(BFrom: Eric Chow [mailto:[EMAIL PROTECTED] 
(BSent: 21 January 2005 11:42
(BTo: Lucene Users List
(BSubject: Re: Search Chinese in Unicode !!!
(B
(B
(BSearch not really correct with UTF-8 !!!
(B
(B
(BThe following is the search result that I used the SearchFiles in the lucene
(Bdemo.
(B
(Bd:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java
(Borg.apache.lucene.demo.SearchFiles c:\temp\myindex
(BUsage: java SearchFiles 
(BQuery: $Be4(J
(BSearching for: g<<<<<<<<<<<<  strange ??
(B3 total matching documents
(B0. ../docs/ChineseDemo.html<<<<<<<<<<<<this files contains
(Bthe $Be4(J
(B   -
(B1. ../docs/luceneplan.html
(B   - Jakarta Lucene - Plan for enhancements to Lucene
(B2. ../docs/api/index-all.html
(B   - Index (Lucene 1.4.3 API)
(BQuery: 
(B
(B
(B
(B>From the above result only the ChineseDemo.html includes the character that I
(Bwant to search !
(B
(B
(B
(B
(BThe modified code in SearchFiles.java:
(B
(B
(BBufferedReader in = new BufferedReader(new InputStreamReader(System.in,
(B"UTF-8"));
(B
(B-
(BTo unsubscribe, e-mail: [EMAIL PROTECTED]
(BFor additional commands, e-mail: [EMAIL PROTECTED]
(B
(B
(B-
(BTo unsubscribe, e-mail: [EMAIL PROTECTED]
(BFor additional commands, e-mail: [EMAIL PROTECTED]

Re: Search Chinese in Unicode !!!

2005-01-21 Thread PA
On Jan 21, 2005, at 11:42, Eric Chow wrote:
Search not really correct with UTF-8 !!!
Lucene works just fine with any flavor of Unicode as long as _your_ 
application knows how to consistently deal with Unicode as well. 
Remember: the world is not just one Big5 pile.

As far as Analyzer goes, you may or may not be better off using 
something more tailored to your linguistic needs. That said, even the 
default Analyzer does a fairly decent job at handling non-roman 
languages. YMMV.

Cheers
--
PA
http://alt.textdrive.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search Chinese in Unicode !!!

2005-01-21 Thread Eric Chow
Search not really correct with UTF-8 !!!


The following is the search result that I used the SearchFiles in the
lucene demo.

d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java
org.apache.lucene.demo.SearchFiles c:\temp\myindex
Usage: java SearchFiles 
Query: ç
Searching for: g  strange ??
3 total matching documents
0. ../docs/ChineseDemo.htmlthis files contains the 
ç
   -
1. ../docs/luceneplan.html
   - Jakarta Lucene - Plan for enhancements to Lucene
2. ../docs/api/index-all.html
   - Index (Lucene 1.4.3 API)
Query: 



>From the above result only the ChineseDemo.html includes the character
that I want to search !




The modified code in SearchFiles.java:


BufferedReader in = new BufferedReader(new
InputStreamReader(System.in, "UTF-8"));

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Chinese in Unicode !!!

2005-01-21 Thread Erik Hatcher
On Jan 21, 2005, at 4:49 AM, Eric Chow wrote:
How to create index with chinese (in utf-8 encoding ) HTML and search
with Lucene ?
Indexing and searching Chinese basically is no different than using 
English with Lucene.  We covered a bit about it in Lucene in Action:

http://www.lucenebook.com/search?query=chinese
And a screenshot here:
http://www.blogscene.org/erik/LuceneInAction/i18n.html
The main issues of dealing with Chinese, and of course other languages, 
are encoding concerns in both indexing and querying of reading in the 
text and analysis (as you can see from the screenshot).

Lucene itself works with Unicode fine and you're free to index anything.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]