[ https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667966#comment-16667966 ]
Dave Meikle commented on TIKA-2760: ----------------------------------- [~markus17] - is it typically the HTML parser being used in Nutch? Using your test with the HTML parser registered gives me 94 links. > LinkContentHandler does not report hyperlinks > --------------------------------------------- > > Key: TIKA-2760 > URL: https://issues.apache.org/jira/browse/TIKA-2760 > Project: Tika > Issue Type: Bug > Affects Versions: 1.19 > Reporter: Markus Jelsma > Priority: Major > Fix For: 1.20 > > Attachments: TIKA-2760.patch, ronaldmcdonald-nolinks.html > > > Nutch uses LinkContentHandler for collection hyperlinks, and does not report > any hyperlink for http://www.ronaldmcdonaldhouse.co.uk/ which i'll also > attach to this ticket. > Debugging LinkContentHandler to print element names in startElement reveals > only very few HTML elements get reported, which i think is incorrect. > Our own parser in Nutch uses a custom ContentHandler and does report many > elements, including hyperlinks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)