[ https://issues.apache.org/jira/browse/TIKA-2760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16668605#comment-16668605 ]
Markus Jelsma commented on TIKA-2760: ------------------------------------- Hello [~davemeikle], I cannot get any links using any HTML parser in Nutch, parse-tika nor parse-html produces any outlinks. Did you test using Nutch or attached unit test. In both cases, i get zero outlinks. Thanks, MArkus > LinkContentHandler does not report hyperlinks > --------------------------------------------- > > Key: TIKA-2760 > URL: https://issues.apache.org/jira/browse/TIKA-2760 > Project: Tika > Issue Type: Bug > Affects Versions: 1.19 > Reporter: Markus Jelsma > Priority: Major > Fix For: 1.20 > > Attachments: TIKA-2760.patch, ronaldmcdonald-nolinks.html > > > Nutch uses LinkContentHandler for collection hyperlinks, and does not report > any hyperlink for http://www.ronaldmcdonaldhouse.co.uk/ which i'll also > attach to this ticket. > Debugging LinkContentHandler to print element names in startElement reveals > only very few HTML elements get reported, which i think is incorrect. > Our own parser in Nutch uses a custom ContentHandler and does report many > elements, including hyperlinks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)