[ https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16351224#comment-16351224 ]
NW Brad commented on TIKA-2562: ------------------------------- I was doing some research on this today and this may not be a function of Tika. I think it is probably the SAXTransformerFactory (javax.xml.transform) that is making the change. At least I could find any code in Tika that did it directly. But anything I ran through the SAXTransformerFactory converted the HTML I provided with void (empty) elements and self-closing start tags as shown below: <a href="http://www.google.com"></a> *becomes* <a href="http://www.google.com*"/>* and <p></p> *becomes* <p/>. >From an XML standpoint the converted syntax is correct, but the anchor tag >code while correct in XML, does not appear to work correctly as HTML in both >the current version of Chrome and Firefox. So, converting HTML via Tika in >this situation generates bad HTML for the examples I have. I believe the SAXTransformerFactory is also deleting the <div> that is around the "empty" anchor tag since a div around nothing is may not be consider relevant. I least that is what I speculate... h1. > tika server parse HTML removes DIVs around hyperlink & adds shape > ----------------------------------------------------------------- > > Key: TIKA-2562 > URL: https://issues.apache.org/jira/browse/TIKA-2562 > Project: Tika > Issue Type: Bug > Components: gui, parser, server > Affects Versions: 1.17 > Reporter: NW Brad > Priority: Major > Attachments: tika_adds_shape_to_hyperlink.html > > > Hyperlinks in a HTML document that are parsed via tika server: > curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html > [http://localhost:9998/tika] --header "Accept: text/html" > sent: > <div> > <a > href="http://www.google.com">[http://www.google.com|http://www.google.com/]</a> > </div> > received back: > <a shape="rect" > href="http://www.google.com">[http://www.google.com|http://www.google.com/]</a> > > Divs are are gone and a shape has been added > -- This message was sent by Atlassian JIRA (v7.6.3#76005)