Re: [Nutch-dev] RE: A problem about Chinese word segment

Jason Tang Wed, 16 Mar 2005 19:12:10 -0800

Hi cao

I think character "的" is stopword in Chinese characters.
I think NutchAnalysis.jj should load different stopwords file when the language 
is different.


/Jack 


  
======= At 2005-03-17, 10:27:40 you wrote: =======

>No anwser for this?
>Any tips are appreciated.
>
>>From: "cao yuzhong" <[EMAIL PROTECTED]>
>>Reply-To: [EMAIL PROTECTED]
>>To: [EMAIL PROTECTED]
>>CC: [EMAIL PROTECTED]
>>Subject: A problem about Chinese word segment
>>Date: Tue, 15 Mar 2005 05:16:30 +0000
>>
>>hi,all
>>
>>Now,Nutch-0.6 simply treats a Chinese character as a single token.
>>I have attempted to make it treating some relative Chinese 
>>characters(called Chinese word) as a token.
>>So I need to modified the Analyzer.
>>
>>First,I modified the file NutchAnalysis.jj in 
>>src/java/net/nutch/analysis.
>>I changed " <SIGRAM: <CJK> > " to " <SIGRAM: (<CJK>)+ > " so that 
>>Nutch can
>>treat one or more Chinese characters as a token. Then I used JavaCC 
>>to generate the code.
>>
>>Second,I have to segment Chinese texts into Chinese words(insert 
>>space between two Chinese words) before indexing so that Nutch can 
>>recognize them.I have written a class
>>to do this and I have modified the function refill() in 
>>FastCharStream.java:
>>
>>below the line :
>>int charsRead =input.read(buffer, newPosition, 
>>buffer.length-newPosition);
>>
>>I added:
>>//----
>>if(charsRead!=-1){
>>
>>String str=new String(buffer,newPostion,charsRead);
>>
>>//do Chinese word segment,fox example
>>//if str1="中文搜索引擎的分词问题"
>>//then str2 will be "中文 搜索引擎 的 分词 问题"
>>String str2 = Spliter.segSentence(str1);
>>
>>while(str2.length()>buffer.length-newPosition){  //expand the buffer
>>          char[] newBuffer = new char[buffer.length*2];
>>          System.arraycopy(buffer, 0, newBuffer, 0, buffer.length);
>>          buffer = newBuffer;
>>}
>>
>>for(int i=0;i<str2.length();i++){
>>            buffer[newPosition+i]=str2.charAt(i);
>>}
>>charsRead=str2.length();
>>  }
>>//----
>>
>>Third, compiling... ,running CrawlTool....
>>Then I used lukeall-0.5 to view the index directory.
>>It's ok---Not single Chinese characters but Chinese words have been 
>>organized as terms.
>>
>>But when I deploy Nutch in Tomcat5.5 and do the searching test,
>>it cann't find anything. What's wrong?
>>
>>I need your hints or you may recommend me some articles about this.
>>
>>Best regards.
>>
>>Cao Yuzhong
>>2005-03-15
>>
>>
>
>
>
>
>-------------------------------------------------------
>SF email is sponsored by - The IT Product Guide
>Read honest & candid reviews on hundreds of IT Products from real users.
>Discover which products truly live up to the hype. Start reading now.
>http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
>_______________________________________________
>Nutch-developers mailing list
>[email protected]
>https://lists.sourceforge.net/lists/listinfo/nutch-developers

= = = = = = = = = = = = = = = = = = = =

HW?)b彩h?+y烛N??v??y?'z?jwbv矾?,?n???!3搿?肚擘j?[???(疥?'!?顾l痘ナX??⒇^?^J肢斗?)??囤?l⑶gr?i?

Re: [Nutch-dev] RE: A problem about Chinese word segment

Reply via email to