[ https://issues.apache.org/jira/browse/NUTCH-632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Doğacan Güney closed NUTCH-632. ------------------------------- Resolution: Won't Fix Fix Version/s: 1.0.0 I am closing this issue as "Won't Fix" because TextParser now uses the new EncodingDetector class. > Bug in TextParser with encoding > ------------------------------- > > Key: NUTCH-632 > URL: https://issues.apache.org/jira/browse/NUTCH-632 > Project: Nutch > Issue Type: Bug > Components: indexer > Affects Versions: 0.9.0 > Environment: Any > Reporter: Antony Bowesman > Fix For: 1.0.0 > > > If a Content object is created with the following Content-Type: text/plain; > charset="windows-1251" > the Content object discards the charset parameter. As a result, when the > TextParser calls > String encoding = StringUtil.parseCharacterEncoding(content.getContentType()); > it always gets null because the contentType stored in the Content object no > longer contains the charset string. The code has changed a lot from 0.9, so > I am not sure if this is still a problem, but I made a fix that simply saves > charset in Content with > if (this.contentType.startsWith("text/")) > this.charset = StringUtil.parseCharacterEncoding(contentType); > and TextParser just calls > String encoding = content.getCharset(); -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.