[ http://issues.apache.org/jira/browse/NUTCH-91?page=all ]
     
Piotr Kosiorowski closed NUTCH-91:
----------------------------------

    Fix Version: 0.7.2-dev
                 0.8-dev
     Resolution: Fixed

Commited with small extension. Thanks.

> empty encoding causes exception
> -------------------------------
>
>          Key: NUTCH-91
>          URL: http://issues.apache.org/jira/browse/NUTCH-91
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev
>     Reporter: Michael Nebel
>      Fix For: 0.7.2-dev, 0.8-dev

>
> I found some sites, where the header says:  "Content-Type: text/html; 
> charset=". This causes an exception in the HtmlParser. My suggestion:
> Index: 
> src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
> ===================================================================
> --- 
> src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java  
> (revision 279397)
> +++ 
> src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java  
> (working copy)
> @@ -120,7 +120,7 @@
>        byte[] contentInOctets = content.getContent();
>        InputSource input = new InputSource(new 
> ByteArrayInputStream(contentInOctets));
>        String encoding = StringUtil.parseCharacterEncoding(contentType);
> -      if (encoding!=null) {
> +      if (encoding!=null && !"".equals(encoding)) {
>          metadata.put("OriginalCharEncoding", encoding);
>          if ((encoding = StringUtil.resolveEncodingAlias(encoding)) != null) {
>            metadata.put("CharEncodingForConversion", encoding);

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to