[ https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249465#comment-15249465 ]
Federico Bonelli commented on NUTCH-1785: ----------------------------------------- I'm experiencing charset issues with this patch, probably due to Sebastian Nagel's remark: bq. conversion via {code} new String(content.getContent()) {code} is needless if base64 is true I will now try to base64 encode the content.getContent() byte array directly, but I was wondering about the inital intent behind the conversion back and forth from byte[] to String and back to byte[] before base64 encoding. {code:java} String binary = new String(content.getContent()); // optionally encode as base64 if (base64) { binary = Base64.encodeBase64String(StringUtils.getBytesUtf8(binary)); } {code} What was the inital intent behind this? > Ability to index raw content > ---------------------------- > > Key: NUTCH-1785 > URL: https://issues.apache.org/jira/browse/NUTCH-1785 > Project: Nutch > Issue Type: New Feature > Components: indexer > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Priority: Minor > Fix For: 1.11 > > Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, > NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunkv2.patch > > > Some use-cases require Nutch to actually write the raw content a configured > indexing back-end. Since Content is never read, a plugin is out of the > question and therefore we need to force IndexJob to process Content as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)