Hi all.

I have a problem in config nutch-default.xml. As I am in China, most ftp sites 
that I want to crawl are encoded in chinese, but when nutch crawl these ftp 
sites,it could not get the correct charset code,and the parse results are 
incomprehensible and useless. so I set <property>
 <name>parser.character.encoding.default</name>
 <value>windows-1252</value>
 </property>
to <value>gb2312</value> and got a very interesting result, nutch now can crawl 
the files and directories of the root directoy of chinese ftp sites without any 
messy characters,but can NOT crawl any files in SUBdirectories,just got a 
result :404 no found.
I know there must be something wrong in config files but how and where can I 
config nutch to crawl a chinese ftp site? 
I 've been working on this problem for halt a month and find no way to solve 
it, Could anyone helo me???

thanks

 
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to