[ 
https://issues.apache.org/jira/browse/NUTCH-606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12567288#action_12567288
 ] 

Andrzej Bialecki  commented on NUTCH-606:
-----------------------------------------

I'm sorry, I should have been clearer ... My point was that it's not necessary 
to check for null host names, because AFAIK URL.getHost() never returns null. 
On the other hand, there are legitimate situations when it can return an empty 
string, so this check that you added in patch v. 3 is in fact harmful. E.g. it 
would filter out all "file:///" urls.

> Refactoring of Generator, run all urls through checks
> -----------------------------------------------------
>
>                 Key: NUTCH-606
>                 URL: https://issues.apache.org/jira/browse/NUTCH-606
>             Project: Nutch
>          Issue Type: Bug
>          Components: generator
>         Environment: all
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-606-1-20080208.patch, NUTCH-606-2-20080208.patch, 
> NUTCH-606-3-20080208.patch
>
>
> Refactor the generator to make sure all host run through checks such as host 
> and protocol checks, ip checks if necessary.  Currently the generator only 
> does this for urls if generate.max.per.host > 0 which by default is -1.  So 
> by default all urls will get collected without checks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to