[ http://issues.apache.org/jira/browse/NUTCH-91?page=all ] Piotr Kosiorowski closed NUTCH-91: ----------------------------------
Fix Version: 0.7.2-dev 0.8-dev Resolution: Fixed Commited with small extension. Thanks. > empty encoding causes exception > ------------------------------- > > Key: NUTCH-91 > URL: http://issues.apache.org/jira/browse/NUTCH-91 > Project: Nutch > Type: Bug > Versions: 0.8-dev > Reporter: Michael Nebel > Fix For: 0.7.2-dev, 0.8-dev > > I found some sites, where the header says: "Content-Type: text/html; > charset=". This causes an exception in the HtmlParser. My suggestion: > Index: > src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java > =================================================================== > --- > src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java > (revision 279397) > +++ > src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java > (working copy) > @@ -120,7 +120,7 @@ > byte[] contentInOctets = content.getContent(); > InputSource input = new InputSource(new > ByteArrayInputStream(contentInOctets)); > String encoding = StringUtil.parseCharacterEncoding(contentType); > - if (encoding!=null) { > + if (encoding!=null && !"".equals(encoding)) { > metadata.put("OriginalCharEncoding", encoding); > if ((encoding = StringUtil.resolveEncodingAlias(encoding)) != null) { > metadata.put("CharEncodingForConversion", encoding); -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira