Hi there, I think I have it more or less thought out, but just in case I missed something, I would like to check with more experienced people.
I Have set up everything to crawl out intranet, with Nutch 0.7. I create the initial index with something like: bin/nutch crawl $MY_URL_FILE -dir $MY_CRAWL_DIR -depth X -topN Y then periodically.... ( daily? ), I mantain such index with either: .- The "Maintenance Shell Script" from "Nutch - The Java Search Engine - Nutch Wiki" http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine or .- The script from "IntranetRecrawl - Nutch Wiki" http://wiki.apache.org/nutch/IntranetRecrawl Both seem to be more or less equivalent. After one of thouse one would restart the web application. Then, it is recommended to remove the whole $MY_CRAWL_DIR every now and then (months) and start all over. To do so one could create the new crawl dir under a different name and then stop the web application, remove and rename the crawl directories and start the web application. Would that be more or less correct? Any special preference for the maintenance script? I guess the recommended intervals for the cleaning and recrawling depend on the site, but any recommendation for a medium intranet? In order to pick up the latest news, would you recommend configuring special recrawls for the "news section" of the web site and run them more frequently? (and then make the whole recrawl less frequent) Any advice is welcome, Thanks, D. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
