Hrishi,

The best solution depends on what you want to do with the HTML data.

Regarding your solution (1) - Nutch already stores it in the content
subdirectory of the segments as Content objects in a MapFile. A better
option would be to write a small map reduce program and specify as input the
content subdir of the segment(s). Nutch uses the standard MapInputFileFormat
from Nutch and so your Map function will get Content objects. Being
implemented as Map-Reduce this will be distributed + gives you the
possiblity to implement any specific processing you want on the map (or
reduce) side.

If you are not too familiar with Hadoop and writing Map Reduce code, I
recommend Tom White's excellent book (http://www.hadoopbook.com/).

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com


2009/5/26 Hrishikesh Agashe <[email protected]>

> Hi,
>
> After doing a bit of research it seems that there are two ways to get HTML
> data out from Nutch:
> 1. Change Nutch code to dump HTML data as it crawls
> 2. Use "readseg" command after crawling finishes and segments are
> generated.
>
> Is this correct?
>
> If so, I would like to know what approach is better. Specifically, in case
> of Nutch on Hadoop, does "readseg" operate in distributed way or it just
> operates on a single machine? If readseg just works on one machine, I don't
> think it's feasible if segment sizes are large. In that case first approach
> is better.
>
> Also, can anyone share their experiences for doing large crawls (1000s of
> websites) and extracting out HTML data?
>
> Thanks,
> --Hrishi
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>

Reply via email to