[ 
https://issues.apache.org/jira/browse/NUTCH-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847153#action_12847153
 ] 

Andrew McCall commented on NUTCH-693:
-------------------------------------

[http://en.wikipedia.org/wiki/Nofollow]

I don't think there is really any consensus on this standard to be honest. Most 
search engines don't index no-follow links per se, but they do follow them for 
crawling. Even Google, who first proposed the nofollow, sometimes actually do 
follow according to some tests linked in the wikipedia article. The results 
show that if the link is already in the index (eg has been followed elsewhere) 
then it does get followed and indexed. 

The nofollow is really just a keyword to point out that the link isn't being 
endorsed by the author - It's more a content guideline than a strict order for 
robots to obey. So I disagree that you're breaking standards or creating a 
robot that's not well behaved by ignoring it. 

I would have liked to have done a bit more with this so that I could have 
respected nofollows, but injected the URL as a brand new seed URL but other 
commitments took over and I never got around to it. Since the ideal nofollow 
behaviour is somewhere between ignoring them and not ignoring them I figured 
the option to ignore them was a good start and submitted the patch, but I'm not 
precious about it.

> Add configurable option for treating nofollow behaviour.
> --------------------------------------------------------
>
>                 Key: NUTCH-693
>                 URL: https://issues.apache.org/jira/browse/NUTCH-693
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Andrew McCall
>            Assignee: Otis Gospodnetic
>            Priority: Minor
>         Attachments: nutch.nofollow.patch
>
>
> For my purposes I'd like to follow links even if they're marked nofollow- 
> Ideally I'd like to follow them, but not pass the link juice between them. 
> I've attached a patch that adds a configuration element 
> parser.html.outlinks.ignore_nofollow which allows the parser to ignore the 
> nofollow elements on a page. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to