RE: Preserve HTML that is being crawled from Nutch?

Markus Jelsma Wed, 13 Nov 2013 01:52:19 -0800

I am not sure what you mean. The raw content, including the HTML, is stored on 
disk by default. Each segment has a content directory containing just that. But 
i don't know what you mean by markup as indexed?
 
-----Original message-----
> From:Reyes, Mark <[email protected]>
> Sent: Wednesday 13th November 2013 2:57
> To: [email protected]
> Subject: Preserve HTML that is being crawled from Nutch?
> 
> Is there a way to preserve the HTML that is being crawled from Nutch 1.7?
> 
> Specifically, instead of normalizing the information that is crawled into a 
> long string value then assigning that to the ‘content’ key (if viewing in 
> JSON), I’d like to see the markup itself as indexed.
> 
> Thanks,
> Mark
> 
> 
> IMPORTANT NOTICE: This e-mail message is intended to be received only by 
> persons entitled to receive the confidential information it may contain. 
> E-mail messages sent from Bridgepoint Education may contain information that 
> is confidential and may be legally privileged. Please do not read, copy, 
> forward or store this message unless you are an intended recipient of it. If 
> you received this transmission in error, please notify the sender by reply 
> e-mail and delete the message and any attachments.

RE: Preserve HTML that is being crawled from Nutch?

Reply via email to