[EMAIL PROTECTED] wrote: > Hi Enis, > > Right, I can easily delete the page from the Lucene index, though I'd > prefer to follow the Nutch protocol and avoid messing something up by > touching the index directly. However, I don't want that page to > re-appear in one of the subsequent fetches. Well, it won't > re-appear, because it will remain missing, but it would be great to > be able to tell Nutch to "forget it" "from everywhere". Is that > doable? I could read and re-write the *Db Maps, but that's a lot of > IO... just to get a couple of URLs erased. I'd prefer a friendly > persuasion where Nutch flags a given page as "forget this page as > soon as possible" and it just happens later on.
Somehow you need to flag those pages, and keep track of them, so they have to remain CrawlDb. The simplest way to do this is, I think, through a scoring filter API - you can add your own filter, which during updatedb operation flags unwanted urls (by means of putting a piece of metadata in CrawlDatum), and then during the generate step it checks this metadata and returns the generateScore = Float.MIN_VALUE - which means this page will never be selected for fetching as long as there are other unfetched pages. You can also modify the Generator to completely skip such flagged pages. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
