Related issue?
http://www.mail-archive.com/[email protected]/msg06135.html

[EMAIL PROTECTED] wrote:
> Hi all.
>
> I have a problem in config nutch-default.xml. As I am in China, most ftp 
> sites that I want to crawl are encoded in chinese, but when nutch crawl these 
> ftp sites,it could not get the correct charset code,and the parse results are 
> incomprehensible and useless. so I set <property>
>  <name>parser.character.encoding.default</name>
>  <value>windows-1252</value>
>  </property>
> to <value>gb2312</value> and got a very interesting result, nutch now can 
> crawl the files and directories of the root directoy of chinese ftp sites 
> without any messy characters,but can NOT crawl any files in 
> SUBdirectories,just got a result :404 no found.
> I know there must be something wrong in config files but how and where can I 
> config nutch to crawl a chinese ftp site? 
> I 've been working on this problem for halt a month and find no way to solve 
> it, Could anyone helo me???
>
> thanks
>
>  


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to