I am not sure what you mean. The raw content, including the HTML, is stored on
disk by default. Each segment has a content directory containing just that. But
i don't know what you mean by markup as indexed?
-----Original message-----
> From:Reyes, Mark <[email protected]>
> Sent: Wednesday 13th November 2013 2:57
> To: [email protected]
> Subject: Preserve HTML that is being crawled from Nutch?
>
> Is there a way to preserve the HTML that is being crawled from Nutch 1.7?
>
> Specifically, instead of normalizing the information that is crawled into a
> long string value then assigning that to the ‘content’ key (if viewing in
> JSON), I’d like to see the markup itself as indexed.
>
> Thanks,
> Mark
>
>
> IMPORTANT NOTICE: This e-mail message is intended to be received only by
> persons entitled to receive the confidential information it may contain.
> E-mail messages sent from Bridgepoint Education may contain information that
> is confidential and may be legally privileged. Please do not read, copy,
> forward or store this message unless you are an intended recipient of it. If
> you received this transmission in error, please notify the sender by reply
> e-mail and delete the message and any attachments.