My problem is Nutch returns nothing when I using any Chinese keywords.
Even though I can find these Chinese keywords in the index files(using luke).
language is different.From: "Jason Tang" <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Subject: Re: [Nutch-dev] RE: A problem about Chinese word segment Date: Thu, 17 Mar 2005 11:08:15 +0800
Hi cao
I think character "的" is stopword in Chinese characters.
I think NutchAnalysis.jj should load different stopwords file when the
/Jack
======= At 2005-03-17, 10:27:40 you wrote: =======
>No anwser for this? >Any tips are appreciated. > >>From: "cao yuzhong" <[EMAIL PROTECTED]> >>Reply-To: [EMAIL PROTECTED] >>To: [EMAIL PROTECTED] >>CC: [EMAIL PROTECTED] >>Subject: A problem about Chinese word segment >>Date: Tue, 15 Mar 2005 05:16:30 +0000 >> >>hi,all >> >>Now,Nutch-0.6 simply treats a Chinese character as a single token. >>I have attempted to make it treating some relative Chinese >>characters(called Chinese word) as a token. >>So I need to modified the Analyzer. >> >>First,I modified the file NutchAnalysis.jj in >>src/java/net/nutch/analysis. >>I changed " <SIGRAM: <CJK> > " to " <SIGRAM: (<CJK>)+ > " so that >>Nutch can >>treat one or more Chinese characters as a token. Then I used JavaCC >>to generate the code. >> >>Second,I have to segment Chinese texts into Chinese words(insert >>space between two Chinese words) before indexing so that Nutch can >>recognize them.I have written a class >>to do this and I have modified the function refill() in >>FastCharStream.java: >> >>below the line : >>int charsRead =input.read(buffer, newPosition, >>buffer.length-newPosition); >> >>I added: >>//---- >>if(charsRead!=-1){ >> >>String str=new String(buffer,newPostion,charsRead); >> >>//do Chinese word segment,fox example >>//if str1="中文搜索引擎的分词问题" >>//then str2 will be "中文 搜索引擎 的 分词 问题" >>String str2 = Spliter.segSentence(str1); >> >>while(str2.length()>buffer.length-newPosition){ //expand the buffer >> char[] newBuffer = new char[buffer.length*2]; >> System.arraycopy(buffer, 0, newBuffer, 0, buffer.length); >> buffer = newBuffer; >>} >> >>for(int i=0;i<str2.length();i++){ >> buffer[newPosition+i]=str2.charAt(i); >>} >>charsRead=str2.length(); >> } >>//---- >> >>Third, compiling... ,running CrawlTool.... >>Then I used lukeall-0.5 to view the index directory. >>It's ok---Not single Chinese characters but Chinese words have been >>organized as terms. >> >>But when I deploy Nutch in Tomcat5.5 and do the searching test, >>it cann't find anything. What's wrong? >> >>I need your hints or you may recommend me some articles about this. >> >>Best regards. >> >>Cao Yuzhong >>2005-03-15 >> >> > > > > >------------------------------------------------------- >SF email is sponsored by - The IT Product Guide >Read honest & candid reviews on hundreds of IT Products from real users. >Discover which products truly live up to the hype. Start reading now. >http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click >_______________________________________________ >Nutch-developers mailing list >[email protected] >https://lists.sourceforge.net/lists/listinfo/nutch-developers
= = = = = = = = = = = = = = = = = = = =
------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
