[ https://issues.apache.org/jira/browse/TIKA-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16350634#comment-16350634 ]
NW Brad edited comment on TIKA-2562 at 2/2/18 4:51 PM: ------------------------------------------------------- Thanks. I checked it out and tagsoup is definitely adding the shape. I tried parsing the file using tagsoup command line, and tagsoup added the shape. However, it appears that the <div> removal is coming from tika. Tagsoup parse results: <div> <a shape="rect" href="http://www.google.com">[http://www.google.com|http://www.google.com/]</a> </div> Tika parse results: <a shape="rect" href="http://www.google.com">[http://www.google.com|http://www.google.com/]</a> The div is gone... I also noted another problem with parsing that is coming from Tika and not tagsoup when dealing with hidden anchors/hyperlinks: original: <a href="http://www.google.com"></a> Tagsoup:results <a shape="rect" href="http://www.google.com*"></a>* Tika results: <a shape="rect" href="http://www.google.com*"/>* Tika seems to alter anchor by removing the end-tag and replacing it with an empty-element tag. This occurs on other tags as well, most common being <p></p> with <p/>. This may not seem to be a big deal, but with anchors it is causing a problem with Chrome and Firefox and the anchor style bleeds into content immediately following the anchor. Is there a way in Tika to turn off this feature? If not, do you know where in the code this occurs. Thanks. was (Author: nwbrad): Thanks. I checked it out and tagsoup is definitely adding the shape. I tried parsing the file using tagsoup command line, and tagsoup is definitely the shape. However, it appears that the <div> removal is coming from tika. Tagsoup parse results: <div> <a shape="rect" href="http://www.google.com">[http://www.google.com|http://www.google.com/]</a> </div> Tika parse results: <a shape="rect" href="http://www.google.com">[http://www.google.com|http://www.google.com/]</a> The div is gone... I also noted another problem with parsing that is coming from Tika and not tagsoup when dealing with hidden anchors/hyperlinks: original: <a href="http://www.google.com"></a> Tagsoup:results <a shape="rect" href="http://www.google.com*"></a>* Tika results: <a shape="rect" href="http://www.google.com*"/>* Tika seems to alter anchor by removing the end-tag and replacing it with an empty-element tag. This occurs on other tags as well, most common being <p></p> with <p/>. This may not seem to be a big deal, but with anchors it is causing a problem with Chrome and Firefox and the anchor style bleeds into content immediately following the anchor. Is there a way in Tika to turn off this feature? If not, do you know where in the code this occurs. Thanks. > tika server parse HTML removes DIVs around hyperlink & adds shape > ----------------------------------------------------------------- > > Key: TIKA-2562 > URL: https://issues.apache.org/jira/browse/TIKA-2562 > Project: Tika > Issue Type: Bug > Components: gui, parser, server > Affects Versions: 1.17 > Reporter: NW Brad > Priority: Major > Attachments: tika_adds_shape_to_hyperlink.html > > > Hyperlinks in a HTML document that are parsed via tika server: > curl -X PUT --upload-file tika_adds_shape_to_hyperlink.html > [http://localhost:9998/tika] --header "Accept: text/html" > sent: > <div> > <a > href="http://www.google.com">[http://www.google.com|http://www.google.com/]</a> > </div> > received back: > <a shape="rect" > href="http://www.google.com">[http://www.google.com|http://www.google.com/]</a> > > Divs are are gone and a shape has been added > -- This message was sent by Atlassian JIRA (v7.6.3#76005)