[ 
https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16671189#comment-16671189
 ] 

Dave Meikle commented on TIKA-2760:
-----------------------------------

Hi [~markus17],

Looking at the Nutch code I can see that TikaParser has logic to honour the 
setting in the robots metadata.  As this page is setting _nofollow,_ the parser 
doesn't add the links found by Tika's LinkContentHandler to the outlinks.

If you remove the nofollow from the HTML files metadata you'll see it all flow 
through into Nutch.

{{<meta name="robots" content="index, nofollow" />}}

to

{{<meta name="robots" content="index" />}}

It should all flow through as normal.

Cheers,
Dave

 

> LinkContentHandler does not report hyperlinks
> ---------------------------------------------
>
>                 Key: TIKA-2760
>                 URL: https://issues.apache.org/jira/browse/TIKA-2760
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.19
>            Reporter: Markus Jelsma
>            Priority: Major
>             Fix For: 1.20
>
>         Attachments: TIKA-2760 - Test for Outlinks.diff, TIKA-2760.patch, 
> ronaldmcdonald-nolinks.html
>
>
> Nutch uses LinkContentHandler for collection hyperlinks, and does not report 
> any hyperlink for http://www.ronaldmcdonaldhouse.co.uk/ which i'll also 
> attach to this ticket.
> Debugging LinkContentHandler to print element names in startElement reveals 
> only very few HTML elements get reported, which i think is incorrect.
> Our own parser in Nutch uses a custom ContentHandler and does report many 
> elements, including hyperlinks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to