No anwser for this?
Any tips are appreciated.

From: "cao yuzhong" <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
CC: [EMAIL PROTECTED]
Subject: A problem about Chinese word segment
Date: Tue, 15 Mar 2005 05:16:30 +0000

hi,all

Now,Nutch-0.6 simply treats a Chinese character as a single token.
I have attempted to make it treating some relative Chinese characters(called Chinese word) as a token.
So I need to modified the Analyzer.


First,I modified the file NutchAnalysis.jj in src/java/net/nutch/analysis.
I changed " <SIGRAM: <CJK> > " to " <SIGRAM: (<CJK>)+ > " so that Nutch can
treat one or more Chinese characters as a token. Then I used JavaCC to generate the code.


Second,I have to segment Chinese texts into Chinese words(insert space between two Chinese words) before indexing so that Nutch can recognize them.I have written a class
to do this and I have modified the function refill() in FastCharStream.java:


below the line :
int charsRead =input.read(buffer, newPosition, buffer.length-newPosition);


I added:
//----
if(charsRead!=-1){

String str=new String(buffer,newPostion,charsRead);

//do Chinese word segment,fox example
//if str1="中文搜索引擎的分词问题"
//then str2 will be "中文 搜索引擎 的 分词 问题"
String str2 = Spliter.segSentence(str1);

while(str2.length()>buffer.length-newPosition){  //expand the buffer
         char[] newBuffer = new char[buffer.length*2];
         System.arraycopy(buffer, 0, newBuffer, 0, buffer.length);
         buffer = newBuffer;
}

for(int i=0;i<str2.length();i++){
           buffer[newPosition+i]=str2.charAt(i);
}
charsRead=str2.length();
 }
//----

Third, compiling... ,running CrawlTool....
Then I used lukeall-0.5 to view the index directory.
It's ok---Not single Chinese characters but Chinese words have been organized as terms.


But when I deploy Nutch in Tomcat5.5 and do the searching test,
it cann't find anything. What's wrong?

I need your hints or you may recommend me some articles about this.

Best regards.

Cao Yuzhong
2005-03-15






-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to