Hi, I'm actually working with nutch 0.8.1 crawler for an university project. I need to crawl completely an intranet website.
Problem : Full crawl take times and resources. I read on this mailing list many things on incremental crawling and i confess that i don't understand everything. At this link ( http://wiki.apache.org/nutch/Automating_Fetches_with_Python ) i see a python script for incremental update but based on DB_unfetched flag. As i crawl ALL my intranet site, i shouldn't have any unfetched page and so this script should make nothing. am i wrong ? At this link ( http://issues.apache.org/jira/browse/NUTCH-61 ) i see a patch for 0.8.1 release of nutch witch allow to crawl only updated page but i don't understand at all how it works or how to use nutch crawler after applying this patch. In addition this patch is announced as untested and unstable. So the question is : is it actually possible to use nutch crawler to make a crawl witch only download and index pages (html pdf doc etc) witch have been updated since last crawl (based on the http protocol) and how ? Thanks, Bonardo Pascal ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
