Andrzej Bialecki wrote: > [EMAIL PROTECTED] wrote: >> Hi Enis, >> > >> Right, I can easily delete the page from the Lucene index, though I'd >> prefer to follow the Nutch protocol and avoid messing something up by >> touching the index directly. However, I don't want that page to >> re-appear in one of the subsequent fetches. Well, it won't >> re-appear, because it will remain missing, but it would be great to >> be able to tell Nutch to "forget it" "from everywhere". Is that >> doable? I could read and re-write the *Db Maps, but that's a lot of >> IO... just to get a couple of URLs erased. I'd prefer a friendly >> persuasion where Nutch flags a given page as "forget this page as >> soon as possible" and it just happens later on. > > Somehow you need to flag those pages, and keep track of them, so they > have to remain CrawlDb. > > The simplest way to do this is, I think, through a scoring filter API > - you can add your own filter, which during updatedb operation flags > unwanted urls (by means of putting a piece of metadata in CrawlDatum), > and then during the generate step it checks this metadata and returns > the generateScore = Float.MIN_VALUE - which means this page will never > be selected for fetching as long as there are other unfetched pages. > > You can also modify the Generator to completely skip such flagged pages. > Maybe we should permanently remove the urls that failed fetching k times from the crawldb, during updatedb operation. Since the web is highly dynamic there can be as many gone sites as new sites(or slightly less). As far as i know once a url is entered to the crawldb it will stay there with one of the possible states : STATUS_DB_UNFETCHED, STATUS_DB_FETCHED, STATUS_DB_GONE, STATUS_LINKED. Am i right?
This way Otis's case will also be resolved. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
