[Nutch-general] URLs and encoding problems

Árni Hermann Reynisson Fri, 15 Jun 2007 14:53:30 -0700

Greetings

I've been using nutch to crawl and index a rather large and complex
website. I discovered that some of the linked pdf files (within the web)
didn't come up when searching for keywords that should've hit something.


I did some digging and found that it's due to the URLs to the pdf files.
Some  of them contain whitespaces and even characters like "ó","ý","æ","þ"
or "ö", none of them being encoded properly, somehow causing nutch, with
either http or httpclient, to fail fetching the document.

So my question is: Do you know if there's a solution to this problem at
nutch's end or if I need to take measures myself either by "fixing" this
in nutch or venture into getting the webmaster to properly encode every
url that is linked inside the web?

Best regards,
Árni Hermann Reynisson
[EMAIL PROTECTED]

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] URLs and encoding problems

Reply via email to