Hi all.
I meet a strange problem in config nutch to crawl chinese ftpsites,when nutch
crawl an ftp site which has directories named in chinese, it get some messy
character like '?????'and could not index and parse correctly , as this problem
may be caused because of undefined charset, I checked the congfig file in
/nutch/conf/nutch-default.xml and find a place that may relate to it :
<property>
<name>parser.character.encoding.default</name>
<value>windows-1252</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>
I change <value>windows-1252</value> to <value>gb2312</value> and got a very
interesting result, nutch now can crawl the files and directories of the root
directoy of chinese ftp sites without any messy characters,but as God knows ,
it can NOT crawl any files in SUBdirectories,just got a result :404 no found.
I know there must be something wrong in config files but how and where can I
config nutch to crawl a chinese ftp site? should I recompile the protocol-ftp
plugin???
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general