[ https://issues.apache.org/jira/browse/NUTCH-1250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-1250: ----------------------------------- Fix Version/s: 1.8 > parse-html does not parse links with empty anchor > ------------------------------------------------- > > Key: NUTCH-1250 > URL: https://issues.apache.org/jira/browse/NUTCH-1250 > Project: Nutch > Issue Type: Bug > Components: parser > Affects Versions: 1.4 > Reporter: Andreas Janning > Fix For: 2.3, 1.8 > > Attachments: DOMContentUtils_v1.patch, DOMContentUtils_v2.patch, > TestDomContentUitls_v1.patch > > > The parse-html plugin does not generate an outlink if the link has no anchor > For example the following HTML-Code does not create an Outlink: > {code:html} > <a href="example.com"></a> > {code} > The JUnit-Test TestDOMContentUtils tries to test this but fails since there > is a comment inside the <a>-Tag. > {code:title=TestDOMContentUtils.java|borderStyle=solid} > new String("<html><head><title> title </title>" > + "</head><body>" > + "<a href=\"g\"><!--no anchor--></a>" > + "<a href=\"g1\"> <!--whitespace--> </a>" > + "<a href=\"g2\"> <img src=test.gif alt='bla bla'> </a>" > + "</body></html>"), > {code} > When you remove the comment the test fails. > {code:title=TestDOMContentUtils.java Test fails|borderStyle=solid} > new String("<html><head><title> title </title>" > + "</head><body>" > + "<a href=\"g\"></a>" // no anchor > + "<a href=\"g1\"> <!--whitespace--> </a>" > + "<a href=\"g2\"> <img src=test.gif alt='bla bla'> </a>" > + "</body></html>"), > {code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira