Re: [Nutch-dev] RE: A problem about Chinese word segment

cao yuzhong Wed, 16 Mar 2005 21:50:30 -0800

I have added Chinese stopwords in String[] STOP_WORDS in NutchAnalysis.jj. My problem is Nutch returns nothing when I using any Chinese keywords. Even though I can find these Chinese keywords in the index files(using luke).

From: "Jason Tang" <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
Subject: Re: [Nutch-dev] RE: A problem about Chinese word segment
Date: Thu, 17 Mar 2005 11:08:15 +0800
Hi cao
I think character "的" is stopword in Chinese characters. I think NutchAnalysis.jj should load different stopwords file when the

language is different.


/Jack

======= At 2005-03-17, 10:27:40 you wrote: =======

>No anwser for this?
>Any tips are appreciated.
>
>>From: "cao yuzhong" <[EMAIL PROTECTED]>
>>Reply-To: [EMAIL PROTECTED]
>>To: [EMAIL PROTECTED]
>>CC: [EMAIL PROTECTED]
>>Subject: A problem about Chinese word segment
>>Date: Tue, 15 Mar 2005 05:16:30 +0000
>>
>>hi,all
>>
>>Now,Nutch-0.6 simply treats a Chinese character as a single token.
>>I have attempted to make it treating some relative Chinese
>>characters(called Chinese word) as a token.
>>So I need to modified the Analyzer.
>>
>>First,I modified the file NutchAnalysis.jj in
>>src/java/net/nutch/analysis.
>>I changed " <SIGRAM: <CJK> > " to " <SIGRAM: (<CJK>)+ > " so that
>>Nutch can
>>treat one or more Chinese characters as a token. Then I used JavaCC
>>to generate the code.
>>
>>Second,I have to segment Chinese texts into Chinese words(insert
>>space between two Chinese words) before indexing so that Nutch can
>>recognize them.I have written a class
>>to do this and I have modified the function refill() in
>>FastCharStream.java:
>>
>>below the line :
>>int charsRead =input.read(buffer, newPosition,
>>buffer.length-newPosition);
>>
>>I added:
>>//----
>>if(charsRead!=-1){
>>
>>String str=new String(buffer,newPostion,charsRead);
>>
>>//do Chinese word segment,fox example
>>//if str1="中文搜索引擎的分词问题"
>>//then str2 will be "中文 搜索引擎 的 分词 问题"
>>String str2 = Spliter.segSentence(str1);
>>
>>while(str2.length()>buffer.length-newPosition){  //expand the buffer
>>          char[] newBuffer = new char[buffer.length*2];
>>          System.arraycopy(buffer, 0, newBuffer, 0, buffer.length);
>>          buffer = newBuffer;
>>}
>>
>>for(int i=0;i<str2.length();i++){
>>            buffer[newPosition+i]=str2.charAt(i);
>>}
>>charsRead=str2.length();
>>  }
>>//----
>>
>>Third, compiling... ,running CrawlTool....
>>Then I used lukeall-0.5 to view the index directory.
>>It's ok---Not single Chinese characters but Chinese words have been
>>organized as terms.
>>
>>But when I deploy Nutch in Tomcat5.5 and do the searching test,
>>it cann't find anything. What's wrong?
>>
>>I need your hints or you may recommend me some articles about this.
>>
>>Best regards.
>>
>>Cao Yuzhong
>>2005-03-15
>>
>>
>
>
>
>
>-------------------------------------------------------
>SF email is sponsored by - The IT Product Guide
>Read honest & candid reviews on hundreds of IT Products from real users.
>Discover which products truly live up to the hype. Start reading now.
>http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
>_______________________________________________
>Nutch-developers mailing list
>[email protected]
>https://lists.sourceforge.net/lists/listinfo/nutch-developers

= = = = = = = = = = = = = = = = = = = =

-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] RE: A problem about Chinese word segment

Reply via email to