When I use "北京" as keywords,the query string is: http://127.0.0.1:8080/search.jsp?query=%E5%8C%97%E4%BA%AC
It returns 0 results.
But when I use class NutchBean to search "北京",it returns 23 hits.
But there maybe something wrong for there are many blank lines within the output.
The output is like this:
Total hits: 23
050317 163326 10 found resource common-terms.utf8 at file:/D:/nutch-0.6/conf/common-terms.utf8
0 20050317162844/6
1 20050317162844/21
2 20050317162844/3
3 20050317162844/22
4 20050317162844/c
5 20050317162844/10
6 20050317162844/11
7 20050317162844/19
8 20050317162844/25
9 20050317162844/d
NutchAnalysis.jj.From: "Jason Tang" <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Subject: Re: Re: [Nutch-dev] RE: A problem about Chinese word segment Date: Thu, 17 Mar 2005 15:13:37 +0800
weird! Nutch supports Chinese characters searching.
Can you print your query string in search.jsp? NOTE: the page should be encoded in UTF-8.
/Jack
======= At 2005-03-17, 13:49:00 you wrote: =======
>I have added Chinese stopwords in String[] STOP_WORDS in
>My problem is Nutch returns nothing when I using any Chinese keywords.users.
>Even though I can find these Chinese keywords in the index files(using
>luke).
>
>
>>From: "Jason Tang" <[EMAIL PROTECTED]>
>>Reply-To: [EMAIL PROTECTED]
>>To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
>>Subject: Re: [Nutch-dev] RE: A problem about Chinese word segment
>>Date: Thu, 17 Mar 2005 11:08:15 +0800
>>
>>Hi cao
>>
>>I think character "的" is stopword in Chinese characters.
>>I think NutchAnalysis.jj should load different stopwords file when the
>language is different.
>>
>>/Jack
>>
>>
>>
>>======= At 2005-03-17, 10:27:40 you wrote: =======
>>
>> >No anwser for this?
>> >Any tips are appreciated.
>> >
>> >>From: "cao yuzhong" <[EMAIL PROTECTED]>
>> >>Reply-To: [EMAIL PROTECTED]
>> >>To: [EMAIL PROTECTED]
>> >>CC: [EMAIL PROTECTED]
>> >>Subject: A problem about Chinese word segment
>> >>Date: Tue, 15 Mar 2005 05:16:30 +0000
>> >>
>> >>hi,all
>> >>
>> >>Now,Nutch-0.6 simply treats a Chinese character as a single token.
>> >>I have attempted to make it treating some relative Chinese
>> >>characters(called Chinese word) as a token.
>> >>So I need to modified the Analyzer.
>> >>
>> >>First,I modified the file NutchAnalysis.jj in
>> >>src/java/net/nutch/analysis.
>> >>I changed " <SIGRAM: <CJK> > " to " <SIGRAM: (<CJK>)+ > " so that
>> >>Nutch can
>> >>treat one or more Chinese characters as a token. Then I used JavaCC
>> >>to generate the code.
>> >>
>> >>Second,I have to segment Chinese texts into Chinese words(insert
>> >>space between two Chinese words) before indexing so that Nutch can
>> >>recognize them.I have written a class
>> >>to do this and I have modified the function refill() in
>> >>FastCharStream.java:
>> >>
>> >>below the line :
>> >>int charsRead =input.read(buffer, newPosition,
>> >>buffer.length-newPosition);
>> >>
>> >>I added:
>> >>//----
>> >>if(charsRead!=-1){
>> >>
>> >>String str=new String(buffer,newPostion,charsRead);
>> >>
>> >>//do Chinese word segment,fox example
>> >>//if str1="中文搜索引擎的分词问题"
>> >>//then str2 will be "中文 搜索引擎 的 分词 问题"
>> >>String str2 = Spliter.segSentence(str1);
>> >>
>> >>while(str2.length()>buffer.length-newPosition){ //expand the buffer
>> >> char[] newBuffer = new char[buffer.length*2];
>> >> System.arraycopy(buffer, 0, newBuffer, 0, buffer.length);
>> >> buffer = newBuffer;
>> >>}
>> >>
>> >>for(int i=0;i<str2.length();i++){
>> >> buffer[newPosition+i]=str2.charAt(i);
>> >>}
>> >>charsRead=str2.length();
>> >> }
>> >>//----
>> >>
>> >>Third, compiling... ,running CrawlTool....
>> >>Then I used lukeall-0.5 to view the index directory.
>> >>It's ok---Not single Chinese characters but Chinese words have been
>> >>organized as terms.
>> >>
>> >>But when I deploy Nutch in Tomcat5.5 and do the searching test,
>> >>it cann't find anything. What's wrong?
>> >>
>> >>I need your hints or you may recommend me some articles about this.
>> >>
>> >>Best regards.
>> >>
>> >>Cao Yuzhong
>> >>2005-03-15
>> >>
>> >>
>> >
>> >
>> >
>> >
>> >-------------------------------------------------------
>> >SF email is sponsored by - The IT Product Guide
>> >Read honest & candid reviews on hundreds of IT Products from real
>> >Discover which products truly live up to the hype. Start reading now. >> >http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click >> >_______________________________________________ >> >Nutch-developers mailing list >> >[email protected] >> >https://lists.sourceforge.net/lists/listinfo/nutch-developers >> >>= = = = = = = = = = = = = = = = = = = = >> > > > > >------------------------------------------------------- >SF email is sponsored by - The IT Product Guide >Read honest & candid reviews on hundreds of IT Products from real users. >Discover which products truly live up to the hype. Start reading now. >http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click >_______________________________________________ >Nutch-developers mailing list >[email protected] >https://lists.sourceforge.net/lists/listinfo/nutch-developers
= = = = = = = = = = = = = = = = = = = =
------------------------------------------------------- SF email is sponsored by - The IT Product Guide Read honest & candid reviews on hundreds of IT Products from real users. Discover which products truly live up to the hype. Start reading now. http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
