[Nutch-general] nutch crawl - incremental update

Bonardo Pascal Mon, 12 Mar 2007 17:08:26 -0800

Hi,

I'm actually working with nutch 0.8.1 crawler for an university project. 
I need to crawl completely an intranet website.


Problem : Full crawl take times and resources.

I read on this mailing list many things on incremental crawling and i 
confess that i don't understand everything.

At this link ( 
http://wiki.apache.org/nutch/Automating_Fetches_with_Python ) i see a 
python script for incremental update but based on DB_unfetched flag. As 
i crawl ALL my intranet site, i shouldn't have any unfetched page and so 
this script should make nothing. am i wrong ?

At this link ( http://issues.apache.org/jira/browse/NUTCH-61 ) i see a 
patch for 0.8.1 release of nutch witch allow to crawl only updated page 
but i don't understand at all how it works or how to use nutch crawler 
after applying this patch. In addition this patch is announced as 
untested and unstable.

So the question is : is it actually possible to use nutch crawler to 
make a crawl witch only download  and index pages (html pdf doc etc) 
witch have been updated since last crawl (based on the http protocol) 
and how ?

Thanks,

Bonardo Pascal


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] nutch crawl - incremental update

Reply via email to