[Nutch-general] Intranet crawling maintenance

Daniel López Wed, 03 Jan 2007 03:59:53 -0800

Hi there,

I think I have it more or less thought out, but just in case I missed 
something, I would like to check with more experienced people.


I Have set up everything to crawl out intranet, with Nutch 0.7.

I create the initial index with something like:

bin/nutch crawl $MY_URL_FILE -dir $MY_CRAWL_DIR -depth X -topN Y

then periodically.... ( daily? ), I mantain such index with either:

.- The "Maintenance Shell Script" from "Nutch - The Java Search Engine - 
Nutch Wiki"
http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine

or

.- The script from "IntranetRecrawl - Nutch Wiki"
http://wiki.apache.org/nutch/IntranetRecrawl

Both seem to be more or less equivalent. After one of thouse one would 
restart the web application.

Then, it is recommended to remove the whole $MY_CRAWL_DIR every now and 
then (months) and start all over. To do so one could create the new 
crawl dir under a different name and then stop the web application, 
remove and rename the crawl directories and start the web application.

Would that be more or less correct? Any special preference for the 
maintenance script? I guess the recommended intervals for the cleaning 
and recrawling depend on the site, but any recommendation for a medium 
intranet?

In order to pick up the latest news, would you recommend configuring 
special recrawls for the "news section" of the web site and run them 
more frequently? (and then make the whole recrawl less frequent)

Any advice is welcome,
Thanks,
D.

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Intranet crawling maintenance

Reply via email to