[jira] [Commented] (NUTCH-1785) Ability to index raw content

Federico Bonelli (JIRA) Wed, 20 Apr 2016 01:09:37 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249465#comment-15249465
 ]


Federico Bonelli commented on NUTCH-1785:
-----------------------------------------

I'm experiencing charset issues with this patch, probably due to Sebastian 
Nagel's remark:
bq. conversion via {code} new String(content.getContent()) {code} is needless 
if base64 is true

I will now try to base64 encode the content.getContent() byte array directly, 
but I was wondering about the inital intent behind the conversion back and 
forth from byte[] to String and back to byte[] before base64 encoding.

{code:java}
String binary = new String(content.getContent());

// optionally encode as base64
if (base64) {
        binary = Base64.encodeBase64String(StringUtils.getBytesUtf8(binary));
}
{code}

What was the inital intent behind this?

> Ability to index raw content
> ----------------------------
>
>                 Key: NUTCH-1785
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1785
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.11
>
>         Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, 
> NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunkv2.patch
>
>
> Some use-cases require Nutch to actually write the raw content a configured 
> indexing back-end. Since Content is never read, a plugin is out of the 
> question and therefore we need to force IndexJob to process Content as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-1785) Ability to index raw content

Reply via email to