Hi,

I'd like to be able to immediately remove certain pages from Nutch (index, 
crawldb, linkdb...).
The scenario is that I'm using Nutch to index a single site or a set of 
internal sites.  Once in a while editors of the site remove a page from the 
site.  When that happens, I want to update at least the index and ideally 
crawldb, linkdb, so that people searching the index don't get the missing page 
in results and end up going there, hitting the 404.

I don't think there is a "direct" way to do this with Nutch, is there?
If there really is no direct way to do this, I was thinking I'd just put the 
URL of the recently removed page into the first next fetchlist and then somehow 
get Nutch to immediately remove that page/URL once it hits a 404.  How does 
that sound?

Is there a way to configure Nutch to delete the page after it gets a 404 for it 
even just once?  I thought I saw the setting for that somewhere a few weeks 
ago, but now I can't find it.

Thanks,
Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to