[ 
https://issues.apache.org/jira/browse/NUTCH-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847071#action_12847071
 ] 

Andrzej Bialecki  commented on NUTCH-800:
-----------------------------------------

I'm puzzled by your problem description. Is Nutch affected by a potentially 
malicious URL data? URL form encoding is just a transport encoding, it doesn't 
make URL inherently safe (or unsafe).

> Generator builds a URL list that is not encoded
> -----------------------------------------------
>
>                 Key: NUTCH-800
>                 URL: https://issues.apache.org/jira/browse/NUTCH-800
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>    Affects Versions: 0.6, 0.7, 0.7.1, 0.7.2, 0.8, 0.8.1, 0.8.2, 0.7.3, 0.9.0, 
> 1.0.0, 1.1
>            Reporter: Jesse Campbell
>
> The URL string that is grabbed by the generator when creating the fetch list 
> does not get encoded, could potentially allow unsafe excecution, and breaks 
> reading improperly encoded URLs from the scraped pages.
> Since we a) cannot guarantee that any site we scrape is not malitious, and b) 
> likely do not have control over all content providers, we are currently 
> forced to use a regex normalizer to perform the same function as a built-in 
> java class (it would be unsafe to leave alone)
> A quick solution would be to update Generator.java to utilize the 
> java.net.URLEncoder static class:
> line 187: 
> old: String urlString = url.toString();
> new: String urlString = URLEncoder.encode(url.toString(),"UTF-8");
> line 192:
> old: u = new URL(url.toString());
> new: u = new URL(urlString);
> The use of URLEncoder.encode could also be at the updatedb stage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to