Sebastian Nagel created NUTCH-2773:
--------------------------------------
Summary: SegmentReader (-dump or -get): show HTML content as UTF-8
Key: NUTCH-2773
URL: https://issues.apache.org/jira/browse/NUTCH-2773
Project: Nutch
Issue Type: Improvement
Components: segment
Affects Versions: 1.16
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
Fix For: 1.17
SegmentReader dumps resp. the output shown by -get is first converted to Java
strings and then shown using UTF-8 as output encoding. The HTML page content is
hold by the container class "Content" as byte[] and if another charset than
UTF-8 is used as original page encoding, the output of SegmentReader may look
flawed. The reader could use the encoding already detected by the parser (if
available) and try to properly recode the HTML page content to UTF-8.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)