Re: [Nutch-general] Removing pages from index immediately

Andrzej Bialecki Thu, 05 Apr 2007 04:44:52 -0700

[EMAIL PROTECTED] wrote:
> Hi Enis,
> 

> Right, I can easily delete the page from the Lucene index, though I'd
> prefer to follow the Nutch protocol and avoid messing something up by
> touching the index directly.  However, I don't want that page to
> re-appear in one of the subsequent fetches.  Well, it won't
> re-appear, because it will remain missing, but it would be great to
> be able to tell Nutch to "forget it" "from everywhere".  Is that
> doable? I could read and re-write the *Db Maps, but that's a lot of
> IO... just to get a couple of URLs erased.  I'd prefer a friendly
> persuasion where Nutch flags a given page as "forget this page as
> soon as possible" and it just happens later on.


Somehow you need to flag those pages, and keep track of them, so they 
have to remain CrawlDb.

The simplest way to do this is, I think, through a scoring filter API - 
you can add your own filter, which during updatedb operation flags 
unwanted urls (by means of putting a piece of metadata in CrawlDatum), 
and then during the generate step it checks this metadata and returns 
the generateScore = Float.MIN_VALUE - which means this page will never 
be selected for fetching as long as there are other unfetched pages.

You can also modify the Generator to completely skip such flagged pages.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Removing pages from index immediately

Reply via email to