Re: Search Chinese in Unicode !!!
The modified code in SearchFiles.java: BufferedReader in = new BufferedReader(new InputStreamReader(System.in, UTF-8)); It might make sense to incorporate a similar change in WordlistLoader. Instead of freader = new FileReader(wordfile); lnr = new LineNumberReader(freader); I think it's preferable to do something like LineNumberReader lnr = new LineNumberReader(new InputStreamReader( new FileInputStream(wordfile), UTF-8)); to load even more languages' files, now that it resides in the analysis-package. Regards, René -- 10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail +++ GMX - die erste Adresse für Mail, Message, More +++ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
I don't have a document with chinese characters to verify this, but it looks right, so I'll add your change to SearchFiles.java. Thanks, Otis --- Eric Chow [EMAIL PROTECTED] wrote: Search not really correct with UTF-8 !!! The following is the search result that I used the SearchFiles in the lucene demo. d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava org.apache.lucene.demo.SearchFiles c:\temp\myindex Usage: java SearchFiles idnex Query: 經 Searching for: g strange ?? 3 total matching documents 0. ../docs/ChineseDemo.htmlthis files contains the 經 - 1. ../docs/luceneplan.html - Jakarta Lucene - Plan for enhancements to Lucene 2. ../docs/api/index-all.html - Index (Lucene 1.4.3 API) Query: From the above result only the ChineseDemo.html includes the character that I want to search ! The modified code in SearchFiles.java: BufferedReader in = new BufferedReader(new InputStreamReader(System.in, UTF-8)); - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
hi, Eric If you can read chinese directly , Please reference to this blog: http://blog.csdn.net/accesine960 or, search weblucene at www.sf.net which is a project based upon lucene by a chinese, name : chedong , his web site is : www.chedong.com good luck Eric Chow [EMAIL PROTECTED] wrote: How to create index with chinese (in utf-8 encoding ) HTML and search with Lucene ? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] : msn: [EMAIL PROTECTED] qq: 443803193 - Do You Yahoo!? 150MP3 1G1000
Re: Search Chinese in Unicode !!!
hi,Safarnejad would you pls send me a copy of your code? zhousp#gmail.com thanks:) On Fri, 21 Jan 2005 17:36:17 +0100, Safarnejad, Ali (AFIS) [EMAIL PROTECTED] wrote: I've written a Chinese Analyzer for Lucene that uses a segmenter written by Erik Peterson. However, as the author of the segmenter does not want his code released under apache open source license (although his code _is_ opensource), I cannot place my work in the Lucene Sandbox. This is unfortunate, because I believe the analyzer works quite well in indexing and searching chinese docs in GB2312 and UTF-8 encoding, and I like more people to test, use, and confirm this. So anyone who wants it, can have it. Just shoot me an email. BTW, I also have written an arabic analyzer, which is collecting dust for similar reasons. Good luck, Ali Safarnejad -Original Message- From: Eric Chow [mailto:[EMAIL PROTECTED] Sent: 21 January 2005 11:42 To: Lucene Users List Subject: Re: Search Chinese in Unicode !!! Search not really correct with UTF-8 !!! The following is the search result that I used the SearchFiles in the lucene demo. d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava org.apache.lucene.demo.SearchFiles c:\temp\myindex Usage: java SearchFiles idnex Query: Searching for: g strange ?? 3 total matching documents 0. ../docs/ChineseDemo.htmlthis files contains the - 1. ../docs/luceneplan.html - Jakarta Lucene - Plan for enhancements to Lucene 2. ../docs/api/index-all.html - Index (Lucene 1.4.3 API) Query: From the above result only the ChineseDemo.html includes the character that I want to search ! The modified code in SearchFiles.java: BufferedReader in = new BufferedReader(new InputStreamReader(System.in, UTF-8)); - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- --- This mail is for maillist only. Any private mail pls send to [EMAIL PROTECTED] - - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Search Chinese in Unicode !!!
How to create index with chinese (in utf-8 encoding ) HTML and search with Lucene ? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
On Jan 21, 2005, at 4:49 AM, Eric Chow wrote: How to create index with chinese (in utf-8 encoding ) HTML and search with Lucene ? Indexing and searching Chinese basically is no different than using English with Lucene. We covered a bit about it in Lucene in Action: http://www.lucenebook.com/search?query=chinese And a screenshot here: http://www.blogscene.org/erik/LuceneInAction/i18n.html The main issues of dealing with Chinese, and of course other languages, are encoding concerns in both indexing and querying of reading in the text and analysis (as you can see from the screenshot). Lucene itself works with Unicode fine and you're free to index anything. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
Search not really correct with UTF-8 !!! The following is the search result that I used the SearchFiles in the lucene demo. d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava org.apache.lucene.demo.SearchFiles c:\temp\myindex Usage: java SearchFiles idnex Query: Searching for: g strange ?? 3 total matching documents 0. ../docs/ChineseDemo.htmlthis files contains the - 1. ../docs/luceneplan.html - Jakarta Lucene - Plan for enhancements to Lucene 2. ../docs/api/index-all.html - Index (Lucene 1.4.3 API) Query: From the above result only the ChineseDemo.html includes the character that I want to search ! The modified code in SearchFiles.java: BufferedReader in = new BufferedReader(new InputStreamReader(System.in, UTF-8)); - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
On Jan 21, 2005, at 11:42, Eric Chow wrote: Search not really correct with UTF-8 !!! Lucene works just fine with any flavor of Unicode as long as _your_ application knows how to consistently deal with Unicode as well. Remember: the world is not just one Big5 pile. As far as Analyzer goes, you may or may not be better off using something more tailored to your linguistic needs. That said, even the default Analyzer does a fairly decent job at handling non-roman languages. YMMV. Cheers -- PA http://alt.textdrive.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search Chinese in Unicode !!!
I've written a Chinese Analyzer for Lucene that uses a segmenter written by (BErik Peterson. However, as the author of the segmenter does not want his code (Breleased under apache open source license (although his code _is_ (Bopensource), I cannot place my work in the Lucene Sandbox. This is (Bunfortunate, because I believe the analyzer works quite well in indexing and (Bsearching chinese docs in GB2312 and UTF-8 encoding, and I like more people (Bto test, use, and confirm this. So anyone who wants it, can have it. Just (Bshoot me an email. (BBTW, I also have written an arabic analyzer, which is collecting dust for (Bsimilar reasons. (BGood luck, (B (BAli Safarnejad (B (B (B-Original Message- (BFrom: Eric Chow [mailto:[EMAIL PROTECTED] (BSent: 21 January 2005 11:42 (BTo: Lucene Users List (BSubject: Re: Search Chinese in Unicode !!! (B (B (BSearch not really correct with UTF-8 !!! (B (B (BThe following is the search result that I used the SearchFiles in the lucene (Bdemo. (B (Bd:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava (Borg.apache.lucene.demo.SearchFiles c:\temp\myindex (BUsage: java SearchFiles idnex (BQuery: $Be4(J (BSearching for: g strange ?? (B3 total matching documents (B0. ../docs/ChineseDemo.htmlthis files contains (Bthe $Be4(J (B - (B1. ../docs/luceneplan.html (B - Jakarta Lucene - Plan for enhancements to Lucene (B2. ../docs/api/index-all.html (B - Index (Lucene 1.4.3 API) (BQuery: (B (B (B (BFrom the above result only the ChineseDemo.html includes the character that I (Bwant to search ! (B (B (B (B (BThe modified code in SearchFiles.java: (B (B (BBufferedReader in = new BufferedReader(new InputStreamReader(System.in, (B"UTF-8")); (B (B- (BTo unsubscribe, e-mail: [EMAIL PROTECTED] (BFor additional commands, e-mail: [EMAIL PROTECTED] (B (B (B- (BTo unsubscribe, e-mail: [EMAIL PROTECTED] (BFor additional commands, e-mail: [EMAIL PROTECTED]
RE: Search Chinese in Unicode !!!
If you are hosting the code somewhere (e.g. your site, SF, java.net, etc.), we should link to them from one of the Lucene pages where we link to related external tools, apps, and such. Otis --- Safarnejad, Ali (AFIS) [EMAIL PROTECTED] wrote: I've written a Chinese Analyzer for Lucene that uses a segmenter written by Erik Peterson. However, as the author of the segmenter does not want his code released under apache open source license (although his code _is_ opensource), I cannot place my work in the Lucene Sandbox. This is unfortunate, because I believe the analyzer works quite well in indexing and searching chinese docs in GB2312 and UTF-8 encoding, and I like more people to test, use, and confirm this. So anyone who wants it, can have it. Just shoot me an email. BTW, I also have written an arabic analyzer, which is collecting dust for similar reasons. Good luck, Ali Safarnejad -Original Message- From: Eric Chow [mailto:[EMAIL PROTECTED] Sent: 21 January 2005 11:42 To: Lucene Users List Subject: Re: Search Chinese in Unicode !!! Search not really correct with UTF-8 !!! The following is the search result that I used the SearchFiles in the lucene demo. d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava org.apache.lucene.demo.SearchFiles c:\temp\myindex Usage: java SearchFiles idnex Query: å´ Searching for: g strange ?? 3 total matching documents 0. ../docs/ChineseDemo.htmlthis files contains the å´ - 1. ../docs/luceneplan.html - Jakarta Lucene - Plan for enhancements to Lucene 2. ../docs/api/index-all.html - Index (Lucene 1.4.3 API) Query: From the above result only the ChineseDemo.html includes the character that I want to search ! The modified code in SearchFiles.java: BufferedReader in = new BufferedReader(new InputStreamReader(System.in, UTF-8)); - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
I would love to give it a try. Please email me at aurora00 at gmail.com. Thanks! Also what is the opinion on the CJKAnalyzer and ChineseAnalyzer? Some people actually said the StandardAnalyzer works better. I wonder what's the pros and cons. I've written a Chinese Analyzer for Lucene that uses a segmenter written by Erik Peterson. However, as the author of the segmenter does not want his code released under apache open source license (although his code _is_ opensource), I cannot place my work in the Lucene Sandbox. This is unfortunate, because I believe the analyzer works quite well in indexing and searching chinese docs in GB2312 and UTF-8 encoding, and I like more people to test, use, and confirm this. So anyone who wants it, can have it. Just shoot me an email. BTW, I also have written an arabic analyzer, which is collecting dust for similar reasons. Good luck, Ali Safarnejad -Original Message- From: Eric Chow [mailto:[EMAIL PROTECTED] Sent: 21 January 2005 11:42 To: Lucene Users List Subject: Re: Search Chinese in Unicode !!! Search not really correct with UTF-8 !!! The following is the search result that I used the SearchFiles in the lucene demo. d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava org.apache.lucene.demo.SearchFiles c:\temp\myindex Usage: java SearchFiles idnex Query: Searching for: g strange ?? 3 total matching documents 0. ../docs/ChineseDemo.htmlthis files contains the - 1. ../docs/luceneplan.html - Jakarta Lucene - Plan for enhancements to Lucene 2. ../docs/api/index-all.html - Index (Lucene 1.4.3 API) Query: From the above result only the ChineseDemo.html includes the character that I want to search ! The modified code in SearchFiles.java: BufferedReader in = new BufferedReader(new InputStreamReader(System.in, UTF-8)); - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Using Opera's revolutionary e-mail client: http://www.opera.com/m2/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
I want that Chinese Anayzer !! On Fri, 21 Jan 2005 17:36:17 +0100, Safarnejad, Ali (AFIS) [EMAIL PROTECTED] wrote: I've written a Chinese Analyzer for Lucene that uses a segmenter written by Erik Peterson. However, as the author of the segmenter does not want his code released under apache open source license (although his code _is_ opensource), I cannot place my work in the Lucene Sandbox. This is unfortunate, because I believe the analyzer works quite well in indexing and searching chinese docs in GB2312 and UTF-8 encoding, and I like more people to test, use, and confirm this. So anyone who wants it, can have it. Just shoot me an email. BTW, I also have written an arabic analyzer, which is collecting dust for similar reasons. Good luck, Ali Safarnejad -Original Message- From: Eric Chow [mailto:[EMAIL PROTECTED] Sent: 21 January 2005 11:42 To: Lucene Users List Subject: Re: Search Chinese in Unicode !!! Search not really correct with UTF-8 !!! The following is the search result that I used the SearchFiles in the lucene demo. d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\srcjava org.apache.lucene.demo.SearchFiles c:\temp\myindex Usage: java SearchFiles idnex Query: Searching for: g strange ?? 3 total matching documents 0. ../docs/ChineseDemo.htmlthis files contains the - 1. ../docs/luceneplan.html - Jakarta Lucene - Plan for enhancements to Lucene 2. ../docs/api/index-all.html - Index (Lucene 1.4.3 API) Query: From the above result only the ChineseDemo.html includes the character that I want to search ! The modified code in SearchFiles.java: BufferedReader in = new BufferedReader(new InputStreamReader(System.in, UTF-8)); - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]