Preserve HTML that is being crawled from Nutch?

Reyes, Mark Tue, 12 Nov 2013 17:58:27 -0800

Is there a way to preserve the HTML that is being crawled from Nutch 1.7?

Specifically, instead of normalizing the information that is crawled into a 
long string value then assigning that to the ‘content’ key (if viewing in 
JSON), I’d like to see the markup itself as indexed.


Thanks,
Mark


IMPORTANT NOTICE: This e-mail message is intended to be received only by 
persons entitled to receive the confidential information it may contain. E-mail 
messages sent from Bridgepoint Education may contain information that is 
confidential and may be legally privileged. Please do not read, copy, forward 
or store this message unless you are an intended recipient of it. If you 
received this transmission in error, please notify the sender by reply e-mail 
and delete the message and any attachments.

Preserve HTML that is being crawled from Nutch?

Reply via email to