Simon Blandford created SOLR-9178: ------------------------------------- Summary: ExtractingRequestHandler doesn't strip HTML and adds metadata tags to indexed body Key: SOLR-9178 URL: https://issues.apache.org/jira/browse/SOLR-9178 Project: Solr Issue Type: Bug Components: update Affects Versions: 6.0.1 Environment: java version "1.8.0_91" 64 bit Linux Mint 17, 64 bit Reporter: Simon Blandford
Starting environment: solr-6.0.1.tgz is downloaded and extracted. We are in the solr-6.0.1 directory. The file, test.html, is downloaded from https://wiki.apache.org/solr/UsingMailingLists. Steps to reproduce: 1) bin/solr start 2) bin/solr create -c mycore 3) curl "http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true" -F "content/tutorial=@test.html" 4) curl http://localhost:8983/solr/mycore/select?q=information Expected result: HTML->Text version of document indexed in <response> body. Actual result: Full HTML, but with anglebrackets removed, being indexed along with other unwanted metadata in the <response> body including fragments of CSS and Javascript that were in the source document. Head of response body below... <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"><int name="status">0</int><int name="QTime">0</int><lst name="params"><str name="q">information</str></lst></lst><result name="response" numFound="1" start="0"><doc><str name="id">doc1</str><arr name="attr_stream_size"><str>20440</str></arr><arr name="attr_x_parsed_by"><str>org.apache.tika.parser.DefaultParser</str><str>org.apache.tika.parser.html.HtmlParser</str></arr><arr name="attr_stream_content_type"><str>text/html</str></arr><arr name="attr_stream_name"><str>test.html</str></arr><arr name="attr_stream_source_info"><str>content/tutorial</str></arr><arr name="attr_dc_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr name="attr_content_encoding"><str>UTF-8</str></arr><arr name="attr_robots"><str>index,nofollow</str></arr><arr name="attr_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr name="attr_content_type"><str>text/html; charset=utf-8</str></arr><arr name="attr_content"><str> stylesheet text/css utf-8 all /wiki/modernized/css/common.css stylesheet text/css utf-8 screen /wiki/modernized/css/screen.css stylesheet text/css utf-8 print /wiki/modernized/css/print.css stylesheet text/css utf-8 projection /wiki/modernized/css/projection.css alternate Solr Wiki: UsingMailingLists /solr/UsingMailingLists?diffs=1&show_att=1&action=rss_rc&unique=0&page=UsingMailingLists&ddiffs=1 application/rss+xml Start /solr/FrontPage Alternate Wiki Markup /solr/UsingMailingLists?action=raw Alternate print Print View /solr/UsingMailingLists?action=print Search /solr/FindPage Index /solr/TitleIndex Glossary /solr/WordIndex Help /solr/HelpOnFormatting stream_size 20440 X-Parsed-By org.apache.tika.parser.DefaultParser X-Parsed-By org.apache.tika.parser.html.HtmlParser stream_content_type text/html stream_name test.html stream_source_info content/tutorial dc:title UsingMailingLists - Solr Wiki Content-Encoding UTF-8 robots index,nofollow Content-Type text/html; charset=utf-8 UsingMailingLists - Solr Wiki header application/x-www-form-urlencoded get searchform /solr/UsingMailingLists hidden action fullsearch hidden context 180 searchinput Search: text searchinput value 20 searchFocus(this) searchBlur(this) searchChange(this) searchChange(this) Search submit titlesearch titlesearch Titles Search Titles submit fullsearch fullsearch Text Search Full Text text/javascript <!--// Initialize search form var f = document.getElementById('searchform'); f.getElementsByTagName('label')[0].style.display = 'none'; var e = document.getElementById('searchinput'); searchChange(e); searchBlur(e); //--> logo rect /solr/FrontPage Solr Wiki -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org