[ 
https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1785:
---------------------------------

    Attachment: NUTCH-1785-trunk.patch

* command is -addBinaryContent
* field is binaryContent
* for Solr field is passed through stripNonCharCodePoint for obvious reasons
* added -base64 option to allow users to index real binary content and not just 
plain (X)HTML but also images, mp3, whatever

This seems to work fine now, not weird exceptions, not enabled by default (had 
a wrong boolean), base and non base works.

> Ability to index raw content
> ----------------------------
>
>                 Key: NUTCH-1785
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1785
>             Project: Nutch
>          Issue Type: New Feature
>          Components: indexer
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.9
>
>         Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, 
> NUTCH-1785-trunk.patch
>
>
> Some use-cases require Nutch to actually write the raw content a configured 
> indexing back-end. Since Content is never read, a plugin is out of the 
> question and therefore we need to force IndexJob to process Content as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to