[jira] [Updated] (SOLR-9178) ExtractingRequestHandler doesn't strip HTML and adds metadata to content body

Simon Blandford (JIRA) Thu, 02 Jun 2016 00:40:32 -0700

     [ 
https://issues.apache.org/jira/browse/SOLR-9178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Simon Blandford updated SOLR-9178:
----------------------------------
    Affects Version/s: 5.0
          Description: 
Starting environment:
solr-6.0.1.tgz is downloaded and extracted. We are in the solr-6.0.1 directory.
The file, test.html, is downloaded from 
https://wiki.apache.org/solr/UsingMailingLists.

Affected versions: 4.10.3 is the last working version. 4.10.4 has some HTML 
comments and Javascript breaking through. Versions >5.0 have full symptoms 
described.

Steps to reproduce:
1) bin/solr start
2) bin/solr create -c mycore

3) curl 
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true";
 -F "content/tutorial=@test.html"

4) curl http://localhost:8983/solr/mycore/select?q=information

Expected result: HTML->Text version of document indexed in <response> content 
body.

Actual result: Full HTML, but with anglebrackets removed, being indexed along 
with other unwanted metadata in the content body including fragments of CSS and 
Javascript that were in the source document. 

Head of response body below...

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int 
name="QTime">0</int><lst name="params"><str 
name="q">information</str></lst></lst><result name="response" numFound="1" 
start="0"><doc><str name="id">doc1</str><arr 
name="attr_stream_size"><str>20440</str></arr><arr 
name="attr_x_parsed_by"><str>org.apache.tika.parser.DefaultParser</str><str>org.apache.tika.parser.html.HtmlParser</str></arr><arr
 name="attr_stream_content_type"><str>text/html</str></arr><arr 
name="attr_stream_name"><str>test.html</str></arr><arr 
name="attr_stream_source_info"><str>content/tutorial</str></arr><arr 
name="attr_dc_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr 
name="attr_content_encoding"><str>UTF-8</str></arr><arr 
name="attr_robots"><str>index,nofollow</str></arr><arr 
name="attr_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr 
name="attr_content_type"><str>text/html; charset=utf-8</str></arr><arr 
name="attr_content"><str> 
 
 stylesheet text/css utf-8 all /wiki/modernized/css/common.css   stylesheet 
text/css utf-8 screen /wiki/modernized/css/screen.css   stylesheet text/css 
utf-8 print /wiki/modernized/css/print.css   stylesheet text/css utf-8 
projection /wiki/modernized/css/projection.css   alternate Solr Wiki: 
UsingMailingLists 
/solr/UsingMailingLists?diffs=1&amp;show_att=1&amp;action=rss_rc&amp;unique=0&amp;page=UsingMailingLists&amp;ddiffs=1
 application/rss+xml   Start /solr/FrontPage   Alternate Wiki Markup 
/solr/UsingMailingLists?action=raw   Alternate print Print View 
/solr/UsingMailingLists?action=print   Search /solr/FindPage   Index 
/solr/TitleIndex   Glossary /solr/WordIndex   Help /solr/HelpOnFormatting   
stream_size 20440  
 X-Parsed-By org.apache.tika.parser.DefaultParser  
 X-Parsed-By org.apache.tika.parser.html.HtmlParser  
 stream_content_type text/html  
 stream_name test.html  
 stream_source_info content/tutorial  
 dc:title UsingMailingLists - Solr Wiki  
 Content-Encoding UTF-8  
 robots index,nofollow  
 Content-Type text/html; charset=utf-8  
 UsingMailingLists - Solr Wiki 
 
 

 header 

 application/x-www-form-urlencoded get searchform /solr/UsingMailingLists 
 
 hidden action fullsearch  
 hidden context 180  
 searchinput Search: 
 text searchinput value  20 searchFocus(this) searchBlur(this) 
searchChange(this) searchChange(this) Search  
 submit titlesearch titlesearch Titles Search Titles  
 submit fullsearch fullsearch Text Search Full Text  
 

 

 text/javascript 
&lt;!--// Initialize search form
var f = document.getElementById('searchform');
f.getElementsByTagName('label')[0].style.display = 'none';
var e = document.getElementById('searchinput');
searchChange(e);
searchBlur(e);
//--&gt;
 

 logo  rect /solr/FrontPage Solr Wiki  


  was:
Starting environment:
solr-6.0.1.tgz is downloaded and extracted. We are in the solr-6.0.1 directory.
The file, test.html, is downloaded from 
https://wiki.apache.org/solr/UsingMailingLists.

Steps to reproduce:
1) bin/solr start
2) bin/solr create -c mycore

3) curl 
"http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true";
 -F "content/tutorial=@test.html"

4) curl http://localhost:8983/solr/mycore/select?q=information

Expected result: HTML->Text version of document indexed in <response> content 
body.

Actual result: Full HTML, but with anglebrackets removed, being indexed along 
with other unwanted metadata in the content body including fragments of CSS and 
Javascript that were in the source document. 

Head of response body below...

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader"><int name="status">0</int><int 
name="QTime">0</int><lst name="params"><str 
name="q">information</str></lst></lst><result name="response" numFound="1" 
start="0"><doc><str name="id">doc1</str><arr 
name="attr_stream_size"><str>20440</str></arr><arr 
name="attr_x_parsed_by"><str>org.apache.tika.parser.DefaultParser</str><str>org.apache.tika.parser.html.HtmlParser</str></arr><arr
 name="attr_stream_content_type"><str>text/html</str></arr><arr 
name="attr_stream_name"><str>test.html</str></arr><arr 
name="attr_stream_source_info"><str>content/tutorial</str></arr><arr 
name="attr_dc_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr 
name="attr_content_encoding"><str>UTF-8</str></arr><arr 
name="attr_robots"><str>index,nofollow</str></arr><arr 
name="attr_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr 
name="attr_content_type"><str>text/html; charset=utf-8</str></arr><arr 
name="attr_content"><str> 
 
 stylesheet text/css utf-8 all /wiki/modernized/css/common.css   stylesheet 
text/css utf-8 screen /wiki/modernized/css/screen.css   stylesheet text/css 
utf-8 print /wiki/modernized/css/print.css   stylesheet text/css utf-8 
projection /wiki/modernized/css/projection.css   alternate Solr Wiki: 
UsingMailingLists 
/solr/UsingMailingLists?diffs=1&amp;show_att=1&amp;action=rss_rc&amp;unique=0&amp;page=UsingMailingLists&amp;ddiffs=1
 application/rss+xml   Start /solr/FrontPage   Alternate Wiki Markup 
/solr/UsingMailingLists?action=raw   Alternate print Print View 
/solr/UsingMailingLists?action=print   Search /solr/FindPage   Index 
/solr/TitleIndex   Glossary /solr/WordIndex   Help /solr/HelpOnFormatting   
stream_size 20440  
 X-Parsed-By org.apache.tika.parser.DefaultParser  
 X-Parsed-By org.apache.tika.parser.html.HtmlParser  
 stream_content_type text/html  
 stream_name test.html  
 stream_source_info content/tutorial  
 dc:title UsingMailingLists - Solr Wiki  
 Content-Encoding UTF-8  
 robots index,nofollow  
 Content-Type text/html; charset=utf-8  
 UsingMailingLists - Solr Wiki 
 
 

 header 

 application/x-www-form-urlencoded get searchform /solr/UsingMailingLists 
 
 hidden action fullsearch  
 hidden context 180  
 searchinput Search: 
 text searchinput value  20 searchFocus(this) searchBlur(this) 
searchChange(this) searchChange(this) Search  
 submit titlesearch titlesearch Titles Search Titles  
 submit fullsearch fullsearch Text Search Full Text  
 

 

 text/javascript 
&lt;!--// Initialize search form
var f = document.getElementById('searchform');
f.getElementsByTagName('label')[0].style.display = 'none';
var e = document.getElementById('searchinput');
searchChange(e);
searchBlur(e);
//--&gt;
 

 logo  rect /solr/FrontPage Solr Wiki  



> ExtractingRequestHandler doesn't strip HTML and adds metadata to content body
> -----------------------------------------------------------------------------
>
>                 Key: SOLR-9178
>                 URL: https://issues.apache.org/jira/browse/SOLR-9178
>             Project: Solr
>          Issue Type: Bug
>          Components: update
>    Affects Versions: 5.0, 6.0.1
>         Environment: java version "1.8.0_91" 64 bit
> Linux Mint 17, 64 bit
>            Reporter: Simon Blandford
>
> Starting environment:
> solr-6.0.1.tgz is downloaded and extracted. We are in the solr-6.0.1 
> directory.
> The file, test.html, is downloaded from 
> https://wiki.apache.org/solr/UsingMailingLists.
> Affected versions: 4.10.3 is the last working version. 4.10.4 has some HTML 
> comments and Javascript breaking through. Versions >5.0 have full symptoms 
> described.
> Steps to reproduce:
> 1) bin/solr start
> 2) bin/solr create -c mycore
> 3) curl 
> "http://localhost:8983/solr/mycore/update/extract?literal.id=doc1&uprefix=attr_&fmap.content=attr_content&commit=true";
>  -F "content/tutorial=@test.html"
> 4) curl http://localhost:8983/solr/mycore/select?q=information
> Expected result: HTML->Text version of document indexed in <response> content 
> body.
> Actual result: Full HTML, but with anglebrackets removed, being indexed along 
> with other unwanted metadata in the content body including fragments of CSS 
> and Javascript that were in the source document. 
> Head of response body below...
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> <lst name="responseHeader"><int name="status">0</int><int 
> name="QTime">0</int><lst name="params"><str 
> name="q">information</str></lst></lst><result name="response" numFound="1" 
> start="0"><doc><str name="id">doc1</str><arr 
> name="attr_stream_size"><str>20440</str></arr><arr 
> name="attr_x_parsed_by"><str>org.apache.tika.parser.DefaultParser</str><str>org.apache.tika.parser.html.HtmlParser</str></arr><arr
>  name="attr_stream_content_type"><str>text/html</str></arr><arr 
> name="attr_stream_name"><str>test.html</str></arr><arr 
> name="attr_stream_source_info"><str>content/tutorial</str></arr><arr 
> name="attr_dc_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr 
> name="attr_content_encoding"><str>UTF-8</str></arr><arr 
> name="attr_robots"><str>index,nofollow</str></arr><arr 
> name="attr_title"><str>UsingMailingLists - Solr Wiki</str></arr><arr 
> name="attr_content_type"><str>text/html; charset=utf-8</str></arr><arr 
> name="attr_content"><str> 
>  
>  stylesheet text/css utf-8 all /wiki/modernized/css/common.css   stylesheet 
> text/css utf-8 screen /wiki/modernized/css/screen.css   stylesheet text/css 
> utf-8 print /wiki/modernized/css/print.css   stylesheet text/css utf-8 
> projection /wiki/modernized/css/projection.css   alternate Solr Wiki: 
> UsingMailingLists 
> /solr/UsingMailingLists?diffs=1&amp;show_att=1&amp;action=rss_rc&amp;unique=0&amp;page=UsingMailingLists&amp;ddiffs=1
>  application/rss+xml   Start /solr/FrontPage   Alternate Wiki Markup 
> /solr/UsingMailingLists?action=raw   Alternate print Print View 
> /solr/UsingMailingLists?action=print   Search /solr/FindPage   Index 
> /solr/TitleIndex   Glossary /solr/WordIndex   Help /solr/HelpOnFormatting   
> stream_size 20440  
>  X-Parsed-By org.apache.tika.parser.DefaultParser  
>  X-Parsed-By org.apache.tika.parser.html.HtmlParser  
>  stream_content_type text/html  
>  stream_name test.html  
>  stream_source_info content/tutorial  
>  dc:title UsingMailingLists - Solr Wiki  
>  Content-Encoding UTF-8  
>  robots index,nofollow  
>  Content-Type text/html; charset=utf-8  
>  UsingMailingLists - Solr Wiki 
>  
>  
>  header 
>  application/x-www-form-urlencoded get searchform /solr/UsingMailingLists 
>  
>  hidden action fullsearch  
>  hidden context 180  
>  searchinput Search: 
>  text searchinput value  20 searchFocus(this) searchBlur(this) 
> searchChange(this) searchChange(this) Search  
>  submit titlesearch titlesearch Titles Search Titles  
>  submit fullsearch fullsearch Text Search Full Text  
>  
>  
>  text/javascript 
> &lt;!--// Initialize search form
> var f = document.getElementById('searchform');
> f.getElementsByTagName('label')[0].style.display = 'none';
> var e = document.getElementById('searchinput');
> searchChange(e);
> searchBlur(e);
> //--&gt;
>  
>  logo  rect /solr/FrontPage Solr Wiki  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (SOLR-9178) ExtractingRequestHandler doesn't strip HTML and adds metadata to content body

Reply via email to