Re: Search Chinese in Unicode !!!

2005-01-26 Thread René Hackl
  The modified code in SearchFiles.java:
  
  
  BufferedReader in = new BufferedReader(new
  InputStreamReader(System.in, UTF-8));

It might make sense to incorporate a similar change in WordlistLoader.
Instead of 

freader = new FileReader(wordfile);
lnr = new LineNumberReader(freader);

I think it's preferable to do something like

LineNumberReader lnr = new LineNumberReader(new InputStreamReader(
new FileInputStream(wordfile), UTF-8));

to load even more languages' files, now that it resides in the
analysis-package.

Regards,
René

-- 
10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
+++ GMX - die erste Adresse für Mail, Message, More +++

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Chinese in Unicode !!!

2005-01-25 Thread Otis Gospodnetic
I don't have a document with chinese characters to verify this, but it
looks right, so I'll add your change to SearchFiles.java.

Thanks,
Otis

--- Eric Chow [EMAIL PROTECTED] wrote:

 Search not really correct with UTF-8 !!!
 
 
 The following is the search result that I used the SearchFiles in the
 lucene demo.
 
 d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava
 org.apache.lucene.demo.SearchFiles c:\temp\myindex
 Usage: java SearchFiles idnex
 Query: 經
 Searching for: g 
 strange ??
 3 total matching documents
 0. ../docs/ChineseDemo.htmlthis files
 contains the 經
-
 1. ../docs/luceneplan.html
- Jakarta Lucene - Plan for enhancements to Lucene
 2. ../docs/api/index-all.html
- Index (Lucene 1.4.3 API)
 Query: 
 
 
 
 From the above result only the ChineseDemo.html includes the
 character
 that I want to search !
 
 
 
 
 The modified code in SearchFiles.java:
 
 
 BufferedReader in = new BufferedReader(new
 InputStreamReader(System.in, UTF-8));
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Chinese in Unicode !!!

2005-01-22 Thread
hi, Eric
 
If you can read chinese directly , Please reference to this blog:
http://blog.csdn.net/accesine960
or, search weblucene at www.sf.net which is a project based upon lucene by a 
chinese, name : chedong , his web site is : www.chedong.com 
 
good luck

Eric Chow [EMAIL PROTECTED] wrote:
How to create index with chinese (in utf-8 encoding ) HTML and search
with Lucene ?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



 

:
msn:  [EMAIL PROTECTED]
qq: 443803193







-
Do You Yahoo!?
150MP3

1G1000

Re: Search Chinese in Unicode !!!

2005-01-22 Thread ansi
hi,Safarnejad
would you pls send me a copy of your code?
zhousp#gmail.com

thanks:)


On Fri, 21 Jan 2005 17:36:17 +0100, Safarnejad, Ali (AFIS)
[EMAIL PROTECTED] wrote:
 I've written a Chinese Analyzer for Lucene that uses a segmenter written by
 Erik Peterson. However, as the author of the segmenter does not want his code
 released under apache open source license (although his code _is_
 opensource), I cannot place my work in the Lucene Sandbox.  This is
 unfortunate, because I believe the analyzer works quite well in indexing and
 searching chinese docs in GB2312 and UTF-8 encoding, and I like more people
 to test, use, and confirm this.  So anyone who wants it, can have it. Just
 shoot me an email.
 BTW, I also have written an arabic analyzer, which is collecting dust for
 similar reasons.
 Good luck,
 
 Ali Safarnejad
 
 
 -Original Message-
 From: Eric Chow [mailto:[EMAIL PROTECTED]
 Sent: 21 January 2005 11:42
 To: Lucene Users List
 Subject: Re: Search Chinese in Unicode !!!
 
 Search not really correct with UTF-8 !!!
 
 The following is the search result that I used the SearchFiles in the lucene
 demo.
 
 d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava
 org.apache.lucene.demo.SearchFiles c:\temp\myindex
 Usage: java SearchFiles idnex
 Query: 
 Searching for: g  strange ??
 3 total matching documents
 0. ../docs/ChineseDemo.htmlthis files contains
 the 
   -
 1. ../docs/luceneplan.html
   - Jakarta Lucene - Plan for enhancements to Lucene
 2. ../docs/api/index-all.html
   - Index (Lucene 1.4.3 API)
 Query:
 
 From the above result only the ChineseDemo.html includes the character that I
 want to search !
 
 The modified code in SearchFiles.java:
 
 BufferedReader in = new BufferedReader(new InputStreamReader(System.in,
 UTF-8));
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-- 
---
This mail is for maillist only.
Any private mail pls send to [EMAIL PROTECTED]
-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Search Chinese in Unicode !!!

2005-01-21 Thread Eric Chow
How to create index with chinese (in utf-8 encoding ) HTML and search
with Lucene ?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Chinese in Unicode !!!

2005-01-21 Thread Erik Hatcher
On Jan 21, 2005, at 4:49 AM, Eric Chow wrote:
How to create index with chinese (in utf-8 encoding ) HTML and search
with Lucene ?
Indexing and searching Chinese basically is no different than using 
English with Lucene.  We covered a bit about it in Lucene in Action:

http://www.lucenebook.com/search?query=chinese
And a screenshot here:
http://www.blogscene.org/erik/LuceneInAction/i18n.html
The main issues of dealing with Chinese, and of course other languages, 
are encoding concerns in both indexing and querying of reading in the 
text and analysis (as you can see from the screenshot).

Lucene itself works with Unicode fine and you're free to index anything.
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search Chinese in Unicode !!!

2005-01-21 Thread Eric Chow
Search not really correct with UTF-8 !!!


The following is the search result that I used the SearchFiles in the
lucene demo.

d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava
org.apache.lucene.demo.SearchFiles c:\temp\myindex
Usage: java SearchFiles idnex
Query: 
Searching for: g  strange ??
3 total matching documents
0. ../docs/ChineseDemo.htmlthis files contains the 

   -
1. ../docs/luceneplan.html
   - Jakarta Lucene - Plan for enhancements to Lucene
2. ../docs/api/index-all.html
   - Index (Lucene 1.4.3 API)
Query: 



From the above result only the ChineseDemo.html includes the character
that I want to search !




The modified code in SearchFiles.java:


BufferedReader in = new BufferedReader(new
InputStreamReader(System.in, UTF-8));

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Chinese in Unicode !!!

2005-01-21 Thread PA
On Jan 21, 2005, at 11:42, Eric Chow wrote:
Search not really correct with UTF-8 !!!
Lucene works just fine with any flavor of Unicode as long as _your_ 
application knows how to consistently deal with Unicode as well. 
Remember: the world is not just one Big5 pile.

As far as Analyzer goes, you may or may not be better off using 
something more tailored to your linguistic needs. That said, even the 
default Analyzer does a fairly decent job at handling non-roman 
languages. YMMV.

Cheers
--
PA
http://alt.textdrive.com/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Search Chinese in Unicode !!!

2005-01-21 Thread Safarnejad, Ali (AFIS)
I've written a Chinese Analyzer for Lucene that uses a segmenter written by
(BErik Peterson. However, as the author of the segmenter does not want his code
(Breleased under apache open source license (although his code _is_
(Bopensource), I cannot place my work in the Lucene Sandbox.  This is
(Bunfortunate, because I believe the analyzer works quite well in indexing and
(Bsearching chinese docs in GB2312 and UTF-8 encoding, and I like more people
(Bto test, use, and confirm this.  So anyone who wants it, can have it. Just
(Bshoot me an email.
(BBTW, I also have written an arabic analyzer, which is collecting dust for
(Bsimilar reasons.
(BGood luck,
(B
(BAli Safarnejad
(B
(B
(B-Original Message-
(BFrom: Eric Chow [mailto:[EMAIL PROTECTED] 
(BSent: 21 January 2005 11:42
(BTo: Lucene Users List
(BSubject: Re: Search Chinese in Unicode !!!
(B
(B
(BSearch not really correct with UTF-8 !!!
(B
(B
(BThe following is the search result that I used the SearchFiles in the lucene
(Bdemo.
(B
(Bd:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava
(Borg.apache.lucene.demo.SearchFiles c:\temp\myindex
(BUsage: java SearchFiles idnex
(BQuery: $Be4(J
(BSearching for: g  strange ??
(B3 total matching documents
(B0. ../docs/ChineseDemo.htmlthis files contains
(Bthe $Be4(J
(B   -
(B1. ../docs/luceneplan.html
(B   - Jakarta Lucene - Plan for enhancements to Lucene
(B2. ../docs/api/index-all.html
(B   - Index (Lucene 1.4.3 API)
(BQuery: 
(B
(B
(B
(BFrom the above result only the ChineseDemo.html includes the character that I
(Bwant to search !
(B
(B
(B
(B
(BThe modified code in SearchFiles.java:
(B
(B
(BBufferedReader in = new BufferedReader(new InputStreamReader(System.in,
(B"UTF-8"));
(B
(B-
(BTo unsubscribe, e-mail: [EMAIL PROTECTED]
(BFor additional commands, e-mail: [EMAIL PROTECTED]
(B
(B
(B-
(BTo unsubscribe, e-mail: [EMAIL PROTECTED]
(BFor additional commands, e-mail: [EMAIL PROTECTED]

RE: Search Chinese in Unicode !!!

2005-01-21 Thread Otis Gospodnetic
If you are hosting the code somewhere (e.g. your site, SF, java.net,
etc.), we should link to them from one of the Lucene pages where we
link to related external tools, apps, and such.

Otis


--- Safarnejad, Ali (AFIS) [EMAIL PROTECTED] wrote:

 I've written a Chinese Analyzer for Lucene that uses a segmenter
 written by
 Erik Peterson. However, as the author of the segmenter does not want
 his code
 released under apache open source license (although his code _is_
 opensource), I cannot place my work in the Lucene Sandbox.  This is
 unfortunate, because I believe the analyzer works quite well in
 indexing and
 searching chinese docs in GB2312 and UTF-8 encoding, and I like more
 people
 to test, use, and confirm this.  So anyone who wants it, can have it.
 Just
 shoot me an email.
 BTW, I also have written an arabic analyzer, which is collecting dust
 for
 similar reasons.
 Good luck,
 
 Ali Safarnejad
 
 
 -Original Message-
 From: Eric Chow [mailto:[EMAIL PROTECTED] 
 Sent: 21 January 2005 11:42
 To: Lucene Users List
 Subject: Re: Search Chinese in Unicode !!!
 
 
 Search not really correct with UTF-8 !!!
 
 
 The following is the search result that I used the SearchFiles in the
 lucene
 demo.
 
 d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava
 org.apache.lucene.demo.SearchFiles c:\temp\myindex
 Usage: java SearchFiles idnex
 Query: å´
 Searching for: g 
 strange ??
 3 total matching documents
 0. ../docs/ChineseDemo.htmlthis files
 contains
 the å´
-
 1. ../docs/luceneplan.html
- Jakarta Lucene - Plan for enhancements to Lucene
 2. ../docs/api/index-all.html
- Index (Lucene 1.4.3 API)
 Query: 
 
 
 
 From the above result only the ChineseDemo.html includes the
 character that I
 want to search !
 
 
 
 
 The modified code in SearchFiles.java:
 
 
 BufferedReader in = new BufferedReader(new
 InputStreamReader(System.in,
 UTF-8));
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Search Chinese in Unicode !!!

2005-01-21 Thread aurora
I would love to give it a try. Please email me at aurora00 at gmail.com.  
Thanks!

Also what is the opinion on the CJKAnalyzer and ChineseAnalyzer? Some  
people actually said the StandardAnalyzer works better. I wonder what's  
the pros and cons.


I've written a Chinese Analyzer for Lucene that uses a segmenter written  
by
Erik Peterson. However, as the author of the segmenter does not want his  
code
released under apache open source license (although his code _is_
opensource), I cannot place my work in the Lucene Sandbox.  This is
unfortunate, because I believe the analyzer works quite well in indexing  
and
searching chinese docs in GB2312 and UTF-8 encoding, and I like more  
people
to test, use, and confirm this.  So anyone who wants it, can have it.  
Just
shoot me an email.
BTW, I also have written an arabic analyzer, which is collecting dust for
similar reasons.
Good luck,

Ali Safarnejad
-Original Message-
From: Eric Chow [mailto:[EMAIL PROTECTED]
Sent: 21 January 2005 11:42
To: Lucene Users List
Subject: Re: Search Chinese in Unicode !!!
Search not really correct with UTF-8 !!!
The following is the search result that I used the SearchFiles in the  
lucene
demo.

d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava
org.apache.lucene.demo.SearchFiles c:\temp\myindex
Usage: java SearchFiles idnex
Query: 
Searching for: g   
strange ??
3 total matching documents
0. ../docs/ChineseDemo.htmlthis files  
contains
the 
   -
1. ../docs/luceneplan.html
   - Jakarta Lucene - Plan for enhancements to Lucene
2. ../docs/api/index-all.html
   - Index (Lucene 1.4.3 API)
Query:


From the above result only the ChineseDemo.html includes the character  
that I
want to search !


The modified code in SearchFiles.java:
BufferedReader in = new BufferedReader(new InputStreamReader(System.in,
UTF-8));
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

--
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Search Chinese in Unicode !!!

2005-01-21 Thread Eric Chow
I want that Chinese Anayzer !!


On Fri, 21 Jan 2005 17:36:17 +0100, Safarnejad, Ali (AFIS)
[EMAIL PROTECTED] wrote:
 I've written a Chinese Analyzer for Lucene that uses a segmenter written by
 Erik Peterson. However, as the author of the segmenter does not want his code
 released under apache open source license (although his code _is_
 opensource), I cannot place my work in the Lucene Sandbox.  This is
 unfortunate, because I believe the analyzer works quite well in indexing and
 searching chinese docs in GB2312 and UTF-8 encoding, and I like more people
 to test, use, and confirm this.  So anyone who wants it, can have it. Just
 shoot me an email.
 BTW, I also have written an arabic analyzer, which is collecting dust for
 similar reasons.
 Good luck,
 
 Ali Safarnejad
 
 
 -Original Message-
 From: Eric Chow [mailto:[EMAIL PROTECTED]
 Sent: 21 January 2005 11:42
 To: Lucene Users List
 Subject: Re: Search Chinese in Unicode !!!
 
 Search not really correct with UTF-8 !!!
 
 The following is the search result that I used the SearchFiles in the lucene
 demo.
 
 d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava
 org.apache.lucene.demo.SearchFiles c:\temp\myindex
 Usage: java SearchFiles idnex
 Query: 
 Searching for: g  strange ??
 3 total matching documents
 0. ../docs/ChineseDemo.htmlthis files contains
 the 
   -
 1. ../docs/luceneplan.html
   - Jakarta Lucene - Plan for enhancements to Lucene
 2. ../docs/api/index-all.html
   - Index (Lucene 1.4.3 API)
 Query:
 
 From the above result only the ChineseDemo.html includes the character that I
 want to search !
 
 The modified code in SearchFiles.java:
 
 BufferedReader in = new BufferedReader(new InputStreamReader(System.in,
 UTF-8));
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]