[
https://issues.apache.org/jira/browse/NUTCH-161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sami Siren resolved NUTCH-161.
------------------------------
Resolution: Fixed
I just committed a fix for this, thanks KuroSaka!
> Change Plain text parser to use parser.character.encoding.default property
> for fall back encoding
> -------------------------------------------------------------------------------------------------
>
> Key: NUTCH-161
> URL: https://issues.apache.org/jira/browse/NUTCH-161
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Environment: any
> Reporter: KuroSaka TeruHiko
> Assigned To: Sami Siren
> Priority: Minor
> Fix For: 1.0.0
>
>
> The value of the property parser.character.encoding.default is used as a
> fallback character encoding (charset) when HTML parser cannot find the
> charset information in HTTP Content-Type header or in META HTTP-EQUIV tag.
> But the plain text parser behaves differently. It just uses the system
> encoding (Java VM file.encodings, which in turn derives from the OS and the
> locale of the environment from which the JVM was spawned). This is not
> pretty. To gurantee a consistent behavior, plain text parser should use the
> value of the same property.
> Though not tested, these changes in
> ./src/plugin/parse-text/src/java/org/apache/nutch/parse/text/TextParser.java
> should do it:
> Insert this statement in the class definition:
> private static String defaultCharEncoding =
> NutchConf.get().get("parser.character.encoding.default", "windows-1252");
> Replace this:
> text = new String(content.getContent()); // use default encoding
> with this:
> text = new String(content.getContent(), defaultCharEncoding ); //
> use default encoding
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers