[ https://issues.apache.org/jira/browse/TIKA-648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089506#comment-13089506 ]
Ken Krugler commented on TIKA-648: ---------------------------------- I think this should be closed, and an improvement request made against TagSoup. The issue is that TagSoup currently will close the open <a> tag when it hits the <div>. But it could hold onto that markup until it gets something else that indicates it's time to assume a missing closing </a>. Then, when it does see the </a>, it could emit the text while dumping the <div> tags. I know, pretty ugly, but I think that's how browsers handle it. > Parsing HTML anchors with embedded div faulty > --------------------------------------------- > > Key: TIKA-648 > URL: https://issues.apache.org/jira/browse/TIKA-648 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.9 > Reporter: Markus Jelsma > Fix For: 1.0 > > > Using Nutch with Tika 0.9 i cannot extract all two outlinks from a given page > [1]. This is because Tika doensn't return the document with the anchor text > embedded and Nutch skips empty anchors when collecting outlinks. > The raw HTML is: > <a href="#"><div>bla 1</div></a> > <a href="#">bla 2</a> > But the parsed HTML with tika-app-1.0-SNAPSHOT.jar -h test.html is: > <a shape="rect" href="#"/>bla 1 > <a shape="rect" href="#">bla 2</a> > [1]: http://people.apache.org/~markus/test.html > Also described on the Tika user list: > http://search.lucidimagination.com/search/document/e74d7e72fd61543a/parsing_html_anchors_with_embedded_div_faulty -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira