Hi, After doing a bit of research it seems that there are two ways to get HTML data out from Nutch: 1. Change Nutch code to dump HTML data as it crawls 2. Use "readseg" command after crawling finishes and segments are generated.
Is this correct? If so, I would like to know what approach is better. Specifically, in case of Nutch on Hadoop, does "readseg" operate in distributed way or it just operates on a single machine? If readseg just works on one machine, I don't think it's feasible if segment sizes are large. In that case first approach is better. Also, can anyone share their experiences for doing large crawls (1000s of websites) and extracting out HTML data? Thanks, --Hrishi DISCLAIMER ========== This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
