[ 
https://issues.apache.org/jira/browse/NUTCH-632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney closed NUTCH-632.
-------------------------------

       Resolution: Won't Fix
    Fix Version/s: 1.0.0

I am closing this issue as "Won't Fix" because TextParser now uses the new 
EncodingDetector class.

> Bug in TextParser with encoding
> -------------------------------
>
>                 Key: NUTCH-632
>                 URL: https://issues.apache.org/jira/browse/NUTCH-632
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 0.9.0
>         Environment: Any
>            Reporter: Antony Bowesman
>             Fix For: 1.0.0
>
>
> If a Content object is created with the following Content-Type: text/plain; 
> charset="windows-1251"
> the Content object discards the charset parameter.  As a result, when the 
> TextParser calls
> String encoding = StringUtil.parseCharacterEncoding(content.getContentType());
> it always gets null because the contentType stored in the Content object no 
> longer contains the charset string.  The code has changed a lot from 0.9, so 
> I am not sure if this is still a problem, but I made a fix that simply saves 
> charset in Content with
>     if (this.contentType.startsWith("text/"))
>         this.charset = StringUtil.parseCharacterEncoding(contentType);
> and TextParser just calls
>     String encoding = content.getCharset();

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to