No anwser for this? Any tips are appreciated.
From: "cao yuzhong" <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] To: [EMAIL PROTECTED] CC: [EMAIL PROTECTED] Subject: A problem about Chinese word segment Date: Tue, 15 Mar 2005 05:16:30 +0000
hi,all
Now,Nutch-0.6 simply treats a Chinese character as a single token.
I have attempted to make it treating some relative Chinese characters(called Chinese word) as a token.
So I need to modified the Analyzer.
First,I modified the file NutchAnalysis.jj in src/java/net/nutch/analysis.
I changed " <SIGRAM: <CJK> > " to " <SIGRAM: (<CJK>)+ > " so that Nutch can
treat one or more Chinese characters as a token. Then I used JavaCC to generate the code.
Second,I have to segment Chinese texts into Chinese words(insert space between two Chinese words) before indexing so that Nutch can recognize them.I have written a class
to do this and I have modified the function refill() in FastCharStream.java:
below the line :
int charsRead =input.read(buffer, newPosition, buffer.length-newPosition);
I added: //---- if(charsRead!=-1){
String str=new String(buffer,newPostion,charsRead);
//do Chinese word segment,fox example //if str1="中文搜索引擎的分词问题" //then str2 will be "中文 搜索引擎 的 分词 问题" String str2 = Spliter.segSentence(str1);
while(str2.length()>buffer.length-newPosition){ //expand the buffer char[] newBuffer = new char[buffer.length*2]; System.arraycopy(buffer, 0, newBuffer, 0, buffer.length); buffer = newBuffer; }
for(int i=0;i<str2.length();i++){ buffer[newPosition+i]=str2.charAt(i); } charsRead=str2.length(); } //----
Third, compiling... ,running CrawlTool....
Then I used lukeall-0.5 to view the index directory.
It's ok---Not single Chinese characters but Chinese words have been organized as terms.
But when I deploy Nutch in Tomcat5.5 and do the searching test, it cann't find anything. What's wrong?
I need your hints or you may recommend me some articles about this.
Best regards.
Cao Yuzhong 2005-03-15
------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
