[jira] [Commented] (NUTCH-2198) Indexing binary content by index-html causes Solr Exception
[ https://issues.apache.org/jira/browse/NUTCH-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090625#comment-15090625 ] Sebastian Nagel commented on NUTCH-2198: Tried to reproduce the Solr exception by indexing on of the JPEGs shown in the log snippet (ciencia11.jpg). * the Solr exception is not caused by this image (or Solr 4.10.4 is safe) * however, the indexed rawcontent is modified. E.g., the 4 leading bytes are stripped: {noformat} % od -tcx1 ciencia11.jpg | head -2 000 377 330 377 341 \v / E x i f \0 \0 M M \0 * ff d8 ff e1 0b 2f 45 78 69 66 00 00 4d 4d 00 2a {noformat} vs. {noformat} % curl -s 'http://localhost:8983/solr/collection1/select?q=url%3A%22http%3A%2F%2Flocalhost%2Fnutch%2Ftest%2Fciencia11.jpg%22&wt=json&indent=true' { "responseHeader":{ "status":0, "QTime":0, "params":{ "q":"url:\"http://localhost/nutch/test/ciencia11.jpg\"";, "indent":"true", "wt":"json"}}, "response":{"numFound":1,"start":0,"docs":[ { "tstamp":"1970-01-01T00:00:00Z", "rawcontent":"#11;/Exif#0;#0;MM#0;*#0;#0;#0;#8;#0; ... {noformat} We need a different mechanism to index HTML or binary content -- as binary field, converting it to Base64, etc. Forcing a string conversion by a platform-dependent charset and then stripping some (but not all!) binary characters away is surely no proper solution. > Indexing binary content by index-html causes Solr Exception > --- > > Key: NUTCH-2198 > URL: https://issues.apache.org/jira/browse/NUTCH-2198 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 2.3.1 >Reporter: Sebastian Nagel > Fix For: 2.4 > > > (reported by [~kalanya] in NUTCH-2168) > If raw binary is indexed using the plugin index-html this may cause an > exception in Solr: > {noformat} > 2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for: > http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg > 2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for: > http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/ > 2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents > 2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents > 2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001 > java.lang.Exception: > org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was > class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char > #137317, byte #139263) > {noformat} > The index-html plugin tries to treat any raw content as readable content > converting it to a String based on the platform-dependent charset (cf. > [Scanner API > docs|http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html]): > {code:title=HtmlIndexingFilter.java} > Scanner scanner = new Scanner(arrayInputStream); > scanner.useDelimiter("\\Z");//To read all scanner content in one > String > String data = ""; > if (scanner.hasNext()) { > data = scanner.next(); > } > doc.add("rawcontent", StringUtil.cleanField(data)); > {code} > The field "rawcontent" is of type "string": > {code:xml|title=conf/schema.xml} > > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2168) Parse-tika fails to retrieve parser
[ https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15090622#comment-15090622 ] Hudson commented on NUTCH-2168: --- SUCCESS: Integrated in Nutch-nutchgora #1545 (See [https://builds.apache.org/job/Nutch-nutchgora/1545/]) NUTCH-2168 Parse-tika fails to retrieve parser (snagel: [http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1723851]) * 2.x/CHANGES.txt * 2.x/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java > Parse-tika fails to retrieve parser > --- > > Key: NUTCH-2168 > URL: https://issues.apache.org/jira/browse/NUTCH-2168 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 2.3.1 >Reporter: Sebastian Nagel > Fix For: 2.3.1 > > Attachments: NUTCH-2168.patch > > > The plugin parse-tika fails to parse most (all?) kinds of document types > (PDF, xlsx, ...) when run via ParserChecker or ParserJob: > {noformat} > 2015-11-12 19:14:30,903 INFO parse.ParserJob - Parsing > http://localhost/pdftest.pdf > 2015-11-12 19:14:30,905 INFO parse.ParserFactory - ... > 2015-11-12 19:14:30,907 ERROR tika.TikaParser - Can't retrieve Tika parser > for mime-type application/pdf > 2015-11-12 19:14:30,913 WARN parse.ParseUtil - Unable to successfully parse > content http://localhost/pdftest.pdf of type application/pdf > {noformat} > The same document is successfully parsed by TestPdfParser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2198) Indexing binary content by index-html causes Solr Exception
[ https://issues.apache.org/jira/browse/NUTCH-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2198: --- Description: (reported by [~kalanya] in NUTCH-2168) If raw binary is indexed using the plugin index-html this may cause an exception in Solr: {noformat} 2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for: http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg 2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for: http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/ 2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents 2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents 2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001 java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char #137317, byte #139263) {noformat} The index-html plugin tries to treat any raw content as readable content converting it to a String based on the platform-dependent charset (cf. [Scanner API docs|http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html]): {code:title=HtmlIndexingFilter.java} Scanner scanner = new Scanner(arrayInputStream); scanner.useDelimiter("\\Z");//To read all scanner content in one String String data = ""; if (scanner.hasNext()) { data = scanner.next(); } doc.add("rawcontent", StringUtil.cleanField(data)); {code} The field "rawcontent" is of type "string": {code:xml|title=conf/schema.xml} {code} was: (reported by [~kalanya] in NUTCH-2168) If raw binary is indexed using the plugin index-html this may cause an exception in Solr: {noformat} 2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for: http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg 2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for: http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/ 2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents 2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents 2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001 java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char #137317, byte #139263) {noformat} The index-html plugin tries to treat any raw content as readable content converting it to a String based on the platform-dependent charset (cf. [Scanner API docus|http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html]): {code:title=HtmlIndexingFilter.java} Scanner scanner = new Scanner(arrayInputStream); scanner.useDelimiter("\\Z");//To read all scanner content in one String String data = ""; if (scanner.hasNext()) { data = scanner.next(); } doc.add("rawcontent", StringUtil.cleanField(data)); {code} The field "rawcontent" is of type "string": {code:xml|title=conf/schema.xml} {code} > Indexing binary content by index-html causes Solr Exception > --- > > Key: NUTCH-2198 > URL: https://issues.apache.org/jira/browse/NUTCH-2198 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 2.3.1 >Reporter: Sebastian Nagel > Fix For: 2.4 > > > (reported by [~kalanya] in NUTCH-2168) > If raw binary is indexed using the plugin index-html this may cause an > exception in Solr: > {noformat} > 2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for: > http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg > 2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for: > http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/ > 2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents > 2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents > 2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001 > java.lang.Exception: > org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was > class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char > #137317, byte #139263) > {noformat} > The index-html plugin tries to treat any raw content as readable content > converting it to a String based on the platform-dependent charset (cf. > [Scanner API > docs|http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html]): > {code:title=HtmlIndexingFilter.java} > Scanner scanner = new Scanner(arrayInputStream); > scanner.useDelimiter("\\
[jira] [Resolved] (NUTCH-2168) Parse-tika fails to retrieve parser
[ https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2168. Resolution: Fixed Committed to 2.x, r1723851. Opened NUTCH-2198 to track the problem when indexing the raw binary content using the plugin index-html. Thanks, [~lewismc] and [~kalanya], for the review! > Parse-tika fails to retrieve parser > --- > > Key: NUTCH-2168 > URL: https://issues.apache.org/jira/browse/NUTCH-2168 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 2.3.1 >Reporter: Sebastian Nagel > Fix For: 2.3.1 > > Attachments: NUTCH-2168.patch > > > The plugin parse-tika fails to parse most (all?) kinds of document types > (PDF, xlsx, ...) when run via ParserChecker or ParserJob: > {noformat} > 2015-11-12 19:14:30,903 INFO parse.ParserJob - Parsing > http://localhost/pdftest.pdf > 2015-11-12 19:14:30,905 INFO parse.ParserFactory - ... > 2015-11-12 19:14:30,907 ERROR tika.TikaParser - Can't retrieve Tika parser > for mime-type application/pdf > 2015-11-12 19:14:30,913 WARN parse.ParseUtil - Unable to successfully parse > content http://localhost/pdftest.pdf of type application/pdf > {noformat} > The same document is successfully parsed by TestPdfParser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2198) Indexing binary content by index-html causes Solr Exception
Sebastian Nagel created NUTCH-2198: -- Summary: Indexing binary content by index-html causes Solr Exception Key: NUTCH-2198 URL: https://issues.apache.org/jira/browse/NUTCH-2198 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 2.3.1 Reporter: Sebastian Nagel Fix For: 2.4 (reported by [~kalanya] in NUTCH-2168) If raw binary is indexed using the plugin index-html this may cause an exception in Solr: {noformat} 2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for: http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg 2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for: http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/ 2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents 2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents 2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001 java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char #137317, byte #139263) {noformat} The index-html plugin tries to treat any raw content as readable content converting it to a String based on the platform-dependent charset (cf. [Scanner API docus|http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html]): {code:title=HtmlIndexingFilter.java} Scanner scanner = new Scanner(arrayInputStream); scanner.useDelimiter("\\Z");//To read all scanner content in one String String data = ""; if (scanner.hasNext()) { data = scanner.next(); } doc.add("rawcontent", StringUtil.cleanField(data)); {code} The field "rawcontent" is of type "string": {code:xml|title=conf/schema.xml} {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)