Re: Search Chinese in Unicode !!!
> > The modified code in SearchFiles.java: > > > > > > BufferedReader in = new BufferedReader(new > > InputStreamReader(System.in, "UTF-8")); It might make sense to incorporate a similar change in WordlistLoader. Instead of freader = new FileReader(wordfile); lnr = new LineNumberReader(freader); I think it's preferable to do something like LineNumberReader lnr = new LineNumberReader(new InputStreamReader( new FileInputStream(wordfile), "UTF-8")); to load even more languages' files, now that it resides in the analysis-package. Regards, René -- 10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail +++ GMX - die erste Adresse für Mail, Message, More +++ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
I don't have a document with chinese characters to verify this, but it looks right, so I'll add your change to SearchFiles.java. Thanks, Otis --- Eric Chow <[EMAIL PROTECTED]> wrote: > Search not really correct with UTF-8 !!! > > > The following is the search result that I used the SearchFiles in the > lucene demo. > > d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java > org.apache.lucene.demo.SearchFiles c:\temp\myindex > Usage: java SearchFiles > Query: ç¶ > Searching for: g > strange ?? > 3 total matching documents > 0. ../docs/ChineseDemo.htmlthis files > contains the ç¶ >- > 1. ../docs/luceneplan.html >- Jakarta Lucene - Plan for enhancements to Lucene > 2. ../docs/api/index-all.html >- Index (Lucene 1.4.3 API) > Query: > > > > From the above result only the ChineseDemo.html includes the > character > that I want to search ! > > > > > The modified code in SearchFiles.java: > > > BufferedReader in = new BufferedReader(new > InputStreamReader(System.in, "UTF-8")); > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
hi,Safarnejad would you pls send me a copy of your code? zhousp#gmail.com thanks:) On Fri, 21 Jan 2005 17:36:17 +0100, Safarnejad, Ali (AFIS) <[EMAIL PROTECTED]> wrote: > I've written a Chinese Analyzer for Lucene that uses a segmenter written by > Erik Peterson. However, as the author of the segmenter does not want his code > released under apache open source license (although his code _is_ > opensource), I cannot place my work in the Lucene Sandbox. This is > unfortunate, because I believe the analyzer works quite well in indexing and > searching chinese docs in GB2312 and UTF-8 encoding, and I like more people > to test, use, and confirm this. So anyone who wants it, can have it. Just > shoot me an email. > BTW, I also have written an arabic analyzer, which is collecting dust for > similar reasons. > Good luck, > > Ali Safarnejad > > > -Original Message- > From: Eric Chow [mailto:[EMAIL PROTECTED] > Sent: 21 January 2005 11:42 > To: Lucene Users List > Subject: Re: Search Chinese in Unicode !!! > > Search not really correct with UTF-8 !!! > > The following is the search result that I used the SearchFiles in the lucene > demo. > > d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java > org.apache.lucene.demo.SearchFiles c:\temp\myindex > Usage: java SearchFiles > Query: ç > Searching for: g<<<<<<<<<<<< strange ?? > 3 total matching documents > 0. ../docs/ChineseDemo.html<<<<<<<<<<<<this files contains > the ç > - > 1. ../docs/luceneplan.html > - Jakarta Lucene - Plan for enhancements to Lucene > 2. ../docs/api/index-all.html > - Index (Lucene 1.4.3 API) > Query: > > From the above result only the ChineseDemo.html includes the character that I > want to search ! > > The modified code in SearchFiles.java: > > BufferedReader in = new BufferedReader(new InputStreamReader(System.in, > "UTF-8")); > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- --- This mail is for maillist only. Any private mail pls send to [EMAIL PROTECTED] - - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
hi, Eric If you can read chinese directly , Please reference to this blog: http://blog.csdn.net/accesine960 or, search weblucene at www.sf.net which is a project based upon lucene by a chinese, name : chedong , his web site is : www.chedong.com good luck Eric Chow <[EMAIL PROTECTED]> wrote: How to create index with chinese (in utf-8 encoding ) HTML and search with Lucene ? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] 自动签名: 请使用机器人服务: msn机器人: [EMAIL PROTECTED] qq机器人: 443803193 - Do You Yahoo!? 150万曲MP3疯狂搜,带您闯入音乐殿堂 美女明星应有尽有,搜遍美图、艳图和酷图 1G就是1000兆,雅虎电邮自助扩容!
Re: Search Chinese in Unicode !!!
I want that Chinese Anayzer !! On Fri, 21 Jan 2005 17:36:17 +0100, Safarnejad, Ali (AFIS) <[EMAIL PROTECTED]> wrote: > I've written a Chinese Analyzer for Lucene that uses a segmenter written by > Erik Peterson. However, as the author of the segmenter does not want his code > released under apache open source license (although his code _is_ > opensource), I cannot place my work in the Lucene Sandbox. This is > unfortunate, because I believe the analyzer works quite well in indexing and > searching chinese docs in GB2312 and UTF-8 encoding, and I like more people > to test, use, and confirm this. So anyone who wants it, can have it. Just > shoot me an email. > BTW, I also have written an arabic analyzer, which is collecting dust for > similar reasons. > Good luck, > > Ali Safarnejad > > > -Original Message- > From: Eric Chow [mailto:[EMAIL PROTECTED] > Sent: 21 January 2005 11:42 > To: Lucene Users List > Subject: Re: Search Chinese in Unicode !!! > > Search not really correct with UTF-8 !!! > > The following is the search result that I used the SearchFiles in the lucene > demo. > > d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java > org.apache.lucene.demo.SearchFiles c:\temp\myindex > Usage: java SearchFiles > Query: ç > Searching for: g<<<<<<<<<<<< strange ?? > 3 total matching documents > 0. ../docs/ChineseDemo.html<<<<<<<<<<<<this files contains > the ç > - > 1. ../docs/luceneplan.html > - Jakarta Lucene - Plan for enhancements to Lucene > 2. ../docs/api/index-all.html > - Index (Lucene 1.4.3 API) > Query: > > From the above result only the ChineseDemo.html includes the character that I > want to search ! > > The modified code in SearchFiles.java: > > BufferedReader in = new BufferedReader(new InputStreamReader(System.in, > "UTF-8")); > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
Hi, I have some studies in Chinese text search. The main problem is how to separate the words. As in Chinese, there is no white space between words. The typical commercial search engines these days use a dictionary based approach. That is, look through the Chinese text and find the words that are in the dictionary. As for those characters that do not match words in the dictionary, you could use bi-gram based approach. Say, a b c, you could index as 2 (pseudo) words, ab, bc. I think pure bi-gram based approach is not good for relative large Chinese text collection, as you end up with many pseudo terms that are not actual words. Cheers, Jian On Fri, 21 Jan 2005 18:55:56 +0100, Safarnejad, Ali (AFIS) <[EMAIL PROTECTED]> wrote: > The ChineseAnalyzer tokenizes based on some english stopwords. The > CJKAnalzyer is not much more sophisticated for Chinese Analysis (2 byte > tokenizing). The analyzer I just sent you (using Erik Peterson's > segmenter:), looks up three dictionaries to segment the chinese text, based > on real word matches. > > > -Original Message- > From: news [mailto:[EMAIL PROTECTED] On Behalf Of aurora > Sent: 21 January 2005 18:29 > To: lucene-user@jakarta.apache.org > Subject: Re: Search Chinese in Unicode !!! > > I would love to give it a try. Please email me at aurora00 at gmail.com. > Thanks! > > Also what is the opinion on the CJKAnalyzer and ChineseAnalyzer? Some > people actually said the StandardAnalyzer works better. I wonder what's > the pros and cons. > > > I've written a Chinese Analyzer for Lucene that uses a segmenter > > written > > by > > Erik Peterson. However, as the author of the segmenter does not want his > > code > > released under apache open source license (although his code _is_ > > opensource), I cannot place my work in the Lucene Sandbox. This is > > unfortunate, because I believe the analyzer works quite well in indexing > > and > > searching chinese docs in GB2312 and UTF-8 encoding, and I like more > > people > > to test, use, and confirm this. So anyone who wants it, can have it. > > Just > > shoot me an email. > > BTW, I also have written an arabic analyzer, which is collecting dust for > > similar reasons. > > Good luck, > > > > Ali Safarnejad > > > > > > -Original Message- > > From: Eric Chow [mailto:[EMAIL PROTECTED] > > Sent: 21 January 2005 11:42 > > To: Lucene Users List > > Subject: Re: Search Chinese in Unicode !!! > > > > > > Search not really correct with UTF-8 !!! > > > > > > The following is the search result that I used the SearchFiles in the > > lucene > > demo. > > > > d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java > > org.apache.lucene.demo.SearchFiles c:\temp\myindex > > Usage: java SearchFiles > > Query: ç > > Searching for: g<<<<<<<<<<<< > > strange ?? > > 3 total matching documents > > 0. ../docs/ChineseDemo.html<<<<<<<<<<<<this files > > contains > > the ç > >- > > 1. ../docs/luceneplan.html > >- Jakarta Lucene - Plan for enhancements to Lucene > > 2. ../docs/api/index-all.html > >- Index (Lucene 1.4.3 API) > > Query: > > > > > > > > From the above result only the ChineseDemo.html includes the character > > that I > > want to search ! > > > > > > > > > > The modified code in SearchFiles.java: > > > > > > BufferedReader in = new BufferedReader(new InputStreamReader(System.in, > > "UTF-8")); > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > -- > Using Opera's revolutionary e-mail client: http://www.opera.com/m2/ > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search Chinese in Unicode !!!
The ChineseAnalyzer tokenizes based on some english stopwords. The (BCJKAnalzyer is not much more sophisticated for Chinese Analysis (2 byte (Btokenizing). The analyzer I just sent you (using Erik Peterson's (Bsegmenter:), looks up three dictionaries to segment the chinese text, based (Bon real word matches. (B (B (B-Original Message- (BFrom: news [mailto:[EMAIL PROTECTED] On Behalf Of aurora (BSent: 21 January 2005 18:29 (BTo: lucene-user@jakarta.apache.org (BSubject: Re: Search Chinese in Unicode !!! (B (B (BI would love to give it a try. Please email me at aurora00 at gmail.com. (BThanks! (B (BAlso what is the opinion on the CJKAnalyzer and ChineseAnalyzer? Some (Bpeople actually said the StandardAnalyzer works better. I wonder what's (Bthe pros and cons. (B (B (B (B> I've written a Chinese Analyzer for Lucene that uses a segmenter (B> written (B> by (B> Erik Peterson. However, as the author of the segmenter does not want his (B> code (B> released under apache open source license (although his code _is_ (B> opensource), I cannot place my work in the Lucene Sandbox. This is (B> unfortunate, because I believe the analyzer works quite well in indexing (B> and (B> searching chinese docs in GB2312 and UTF-8 encoding, and I like more (B> people (B> to test, use, and confirm this. So anyone who wants it, can have it. (B> Just (B> shoot me an email. (B> BTW, I also have written an arabic analyzer, which is collecting dust for (B> similar reasons. (B> Good luck, (B> (B> Ali Safarnejad (B> (B> (B> -Original Message- (B> From: Eric Chow [mailto:[EMAIL PROTECTED] (B> Sent: 21 January 2005 11:42 (B> To: Lucene Users List (B> Subject: Re: Search Chinese in Unicode !!! (B> (B> (B> Search not really correct with UTF-8 !!! (B> (B> (B> The following is the search result that I used the SearchFiles in the (B> lucene (B> demo. (B> (B> d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java (B> org.apache.lucene.demo.SearchFiles c:\temp\myindex (B> Usage: java SearchFiles (B> Query: $Be4(J (B> Searching for: g<<<<<<<<<<<< (B> strange ?? (B> 3 total matching documents (B> 0. ../docs/ChineseDemo.html<<<<<<<<<<<<this files (B> contains (B> the $Be4(J (B>- (B> 1. ../docs/luceneplan.html (B>- Jakarta Lucene - Plan for enhancements to Lucene (B> 2. ../docs/api/index-all.html (B>- Index (Lucene 1.4.3 API) (B> Query: (B> (B> (B> (B> From the above result only the ChineseDemo.html includes the character (B> that I (B> want to search ! (B> (B> (B> (B> (B> The modified code in SearchFiles.java: (B> (B> (B> BufferedReader in = new BufferedReader(new InputStreamReader(System.in, (B> "UTF-8")); (B> (B> - (B> To unsubscribe, e-mail: [EMAIL PROTECTED] (B> For additional commands, e-mail: [EMAIL PROTECTED] (B (B (B (B-- (BUsing Opera's revolutionary e-mail client: http://www.opera.com/m2/ (B (B (B- (BTo unsubscribe, e-mail: [EMAIL PROTECTED] (BFor additional commands, e-mail: [EMAIL PROTECTED] (B (B (B- (BTo unsubscribe, e-mail: [EMAIL PROTECTED] (BFor additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
I would love to give it a try. Please email me at aurora00 at gmail.com. Thanks! Also what is the opinion on the CJKAnalyzer and ChineseAnalyzer? Some people actually said the StandardAnalyzer works better. I wonder what's the pros and cons. I've written a Chinese Analyzer for Lucene that uses a segmenter written by Erik Peterson. However, as the author of the segmenter does not want his code released under apache open source license (although his code _is_ opensource), I cannot place my work in the Lucene Sandbox. This is unfortunate, because I believe the analyzer works quite well in indexing and searching chinese docs in GB2312 and UTF-8 encoding, and I like more people to test, use, and confirm this. So anyone who wants it, can have it. Just shoot me an email. BTW, I also have written an arabic analyzer, which is collecting dust for similar reasons. Good luck, Ali Safarnejad -Original Message- From: Eric Chow [mailto:[EMAIL PROTECTED] Sent: 21 January 2005 11:42 To: Lucene Users List Subject: Re: Search Chinese in Unicode !!! Search not really correct with UTF-8 !!! The following is the search result that I used the SearchFiles in the lucene demo. d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java org.apache.lucene.demo.SearchFiles c:\temp\myindex Usage: java SearchFiles Query: ç Searching for: g<<<<<<<<<<<< strange ?? 3 total matching documents 0. ../docs/ChineseDemo.html<<<<<<<<<<<<this files contains the ç - 1. ../docs/luceneplan.html - Jakarta Lucene - Plan for enhancements to Lucene 2. ../docs/api/index-all.html - Index (Lucene 1.4.3 API) Query: From the above result only the ChineseDemo.html includes the character that I want to search ! The modified code in SearchFiles.java: BufferedReader in = new BufferedReader(new InputStreamReader(System.in, "UTF-8")); - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Using Opera's revolutionary e-mail client: http://www.opera.com/m2/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search Chinese in Unicode !!!
If you are hosting the code somewhere (e.g. your site, SF, java.net, etc.), we should link to them from one of the Lucene pages where we link to related external tools, apps, and such. Otis --- "Safarnejad, Ali (AFIS)" <[EMAIL PROTECTED]> wrote: > I've written a Chinese Analyzer for Lucene that uses a segmenter > written by > Erik Peterson. However, as the author of the segmenter does not want > his code > released under apache open source license (although his code _is_ > opensource), I cannot place my work in the Lucene Sandbox. This is > unfortunate, because I believe the analyzer works quite well in > indexing and > searching chinese docs in GB2312 and UTF-8 encoding, and I like more > people > to test, use, and confirm this. So anyone who wants it, can have it. > Just > shoot me an email. > BTW, I also have written an arabic analyzer, which is collecting dust > for > similar reasons. > Good luck, > > Ali Safarnejad > > > -Original Message- > From: Eric Chow [mailto:[EMAIL PROTECTED] > Sent: 21 January 2005 11:42 > To: Lucene Users List > Subject: Re: Search Chinese in Unicode !!! > > > Search not really correct with UTF-8 !!! > > > The following is the search result that I used the SearchFiles in the > lucene > demo. > > d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java > org.apache.lucene.demo.SearchFiles c:\temp\myindex > Usage: java SearchFiles > Query: å´ > Searching for: g<<<<<<<<<<<< > strange ?? > 3 total matching documents > 0. ../docs/ChineseDemo.html<<<<<<<<<<<<this files > contains > the å´ >- > 1. ../docs/luceneplan.html >- Jakarta Lucene - Plan for enhancements to Lucene > 2. ../docs/api/index-all.html >- Index (Lucene 1.4.3 API) > Query: > > > > From the above result only the ChineseDemo.html includes the > character that I > want to search ! > > > > > The modified code in SearchFiles.java: > > > BufferedReader in = new BufferedReader(new > InputStreamReader(System.in, > "UTF-8")); > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Search Chinese in Unicode !!!
I've written a Chinese Analyzer for Lucene that uses a segmenter written by (BErik Peterson. However, as the author of the segmenter does not want his code (Breleased under apache open source license (although his code _is_ (Bopensource), I cannot place my work in the Lucene Sandbox. This is (Bunfortunate, because I believe the analyzer works quite well in indexing and (Bsearching chinese docs in GB2312 and UTF-8 encoding, and I like more people (Bto test, use, and confirm this. So anyone who wants it, can have it. Just (Bshoot me an email. (BBTW, I also have written an arabic analyzer, which is collecting dust for (Bsimilar reasons. (BGood luck, (B (BAli Safarnejad (B (B (B-Original Message- (BFrom: Eric Chow [mailto:[EMAIL PROTECTED] (BSent: 21 January 2005 11:42 (BTo: Lucene Users List (BSubject: Re: Search Chinese in Unicode !!! (B (B (BSearch not really correct with UTF-8 !!! (B (B (BThe following is the search result that I used the SearchFiles in the lucene (Bdemo. (B (Bd:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java (Borg.apache.lucene.demo.SearchFiles c:\temp\myindex (BUsage: java SearchFiles (BQuery: $Be4(J (BSearching for: g<<<<<<<<<<<< strange ?? (B3 total matching documents (B0. ../docs/ChineseDemo.html<<<<<<<<<<<<this files contains (Bthe $Be4(J (B - (B1. ../docs/luceneplan.html (B - Jakarta Lucene - Plan for enhancements to Lucene (B2. ../docs/api/index-all.html (B - Index (Lucene 1.4.3 API) (BQuery: (B (B (B (B>From the above result only the ChineseDemo.html includes the character that I (Bwant to search ! (B (B (B (B (BThe modified code in SearchFiles.java: (B (B (BBufferedReader in = new BufferedReader(new InputStreamReader(System.in, (B"UTF-8")); (B (B- (BTo unsubscribe, e-mail: [EMAIL PROTECTED] (BFor additional commands, e-mail: [EMAIL PROTECTED] (B (B (B- (BTo unsubscribe, e-mail: [EMAIL PROTECTED] (BFor additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
On Jan 21, 2005, at 11:42, Eric Chow wrote: Search not really correct with UTF-8 !!! Lucene works just fine with any flavor of Unicode as long as _your_ application knows how to consistently deal with Unicode as well. Remember: the world is not just one Big5 pile. As far as Analyzer goes, you may or may not be better off using something more tailored to your linguistic needs. That said, even the default Analyzer does a fairly decent job at handling non-roman languages. YMMV. Cheers -- PA http://alt.textdrive.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
Search not really correct with UTF-8 !!! The following is the search result that I used the SearchFiles in the lucene demo. d:\Downloads\Softwares\Apache\Lucene\lucene-1.4.3\src>java org.apache.lucene.demo.SearchFiles c:\temp\myindex Usage: java SearchFiles Query: ç Searching for: g strange ?? 3 total matching documents 0. ../docs/ChineseDemo.htmlthis files contains the ç - 1. ../docs/luceneplan.html - Jakarta Lucene - Plan for enhancements to Lucene 2. ../docs/api/index-all.html - Index (Lucene 1.4.3 API) Query: >From the above result only the ChineseDemo.html includes the character that I want to search ! The modified code in SearchFiles.java: BufferedReader in = new BufferedReader(new InputStreamReader(System.in, "UTF-8")); - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Search Chinese in Unicode !!!
On Jan 21, 2005, at 4:49 AM, Eric Chow wrote: How to create index with chinese (in utf-8 encoding ) HTML and search with Lucene ? Indexing and searching Chinese basically is no different than using English with Lucene. We covered a bit about it in Lucene in Action: http://www.lucenebook.com/search?query=chinese And a screenshot here: http://www.blogscene.org/erik/LuceneInAction/i18n.html The main issues of dealing with Chinese, and of course other languages, are encoding concerns in both indexing and querying of reading in the text and analysis (as you can see from the screenshot). Lucene itself works with Unicode fine and you're free to index anything. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]