Hi There is wonderful discussion in Heritrix mailist. I cannot help FWDing some information here. And hope it helps for nutch
--------------------------------------------------------------------------------------------------------- Dennis Hotson wrote: > I'm just wondering whether anyone has written a filter or module to do > incremental crawling. You've see the AdaptiveRevisitingFrontier Frontier? Its described in outline here, http://crawler.archive.org/articles/user_manual.html#arf, and in detail, here: http://vefsofnun.bok.hi.is/thesis/ar.pdf. > What I mean is something that will do a HEAD request on pages and then > only fetch the actual content if the page has been updated (newer last- > modified date or similar). This technique saves a lot of bandwidth and > can speed up crawling for sites that aren't updated very often. > > I've written a proof of concept filter class that does this (well > actually, it's not quite working yet). How does your filter work? St.Ack > > If somebody else has already solved this problem it would save me a lot > of effort. Thanks! :D > > Cheers, > Dennis > > > > Regards /Jack -- Keep Discovering ... ... http://www.jroller.com/page/jmars ------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
