Hi,

After doing a bit of research it seems that there are two ways to get HTML data 
out from Nutch:
1. Change Nutch code to dump HTML data as it crawls
2. Use "readseg" command after crawling finishes and segments are generated.

Is this correct?

If so, I would like to know what approach is better. Specifically, in case of 
Nutch on Hadoop, does "readseg" operate in distributed way or it just operates 
on a single machine? If readseg just works on one machine, I don't think it's 
feasible if segment sizes are large. In that case first approach is better.

Also, can anyone share their experiences for doing large crawls (1000s of 
websites) and extracting out HTML data?

Thanks,
--Hrishi

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.

Reply via email to