[ https://issues.apache.org/jira/browse/TIKA-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774107#comment-16774107 ]
Tim Allison commented on TIKA-2755: ----------------------------------- ~s/^[\r\n]|[\r\n]$// ~s/[\r\n]{2}/\n/ But seriously, I don't. Those are artifacts of the xhtml->text conversion. Maybe take a look in what we're doing in the ToTextHandler? > Allow Tika to skip extraction of <img> tags in HTML > --------------------------------------------------- > > Key: TIKA-2755 > URL: https://issues.apache.org/jira/browse/TIKA-2755 > Project: Tika > Issue Type: Improvement > Components: server > Affects Versions: 1.19.1 > Reporter: Harinder > Priority: Major > Attachments: TestForImageTag.html > > > We are using Tika Server to extract text from HTML files. Tika extracts the > alt text of image tags present in HTML files as _[image: this is the alt text > of the image]_. This ends up in Solr and shows up in the results when we > generate document summaries at query time (via Solr’s highlight > functionality). > If you PUT the attached HTML file to /tika, it will return the following > response > {code:java} > [image: Return to the homepage] > This is a test{code} > It would be nice to have just this instead > {code:java} > This is a test {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)