Andrzej Bialecki wrote:
> [EMAIL PROTECTED] wrote:
>> Hi Enis,
>>
>
>> Right, I can easily delete the page from the Lucene index, though I'd
>> prefer to follow the Nutch protocol and avoid messing something up by
>> touching the index directly.  However, I don't want that page to
>> re-appear in one of the subsequent fetches.  Well, it won't
>> re-appear, because it will remain missing, but it would be great to
>> be able to tell Nutch to "forget it" "from everywhere".  Is that
>> doable? I could read and re-write the *Db Maps, but that's a lot of
>> IO... just to get a couple of URLs erased.  I'd prefer a friendly
>> persuasion where Nutch flags a given page as "forget this page as
>> soon as possible" and it just happens later on.
>
> Somehow you need to flag those pages, and keep track of them, so they 
> have to remain CrawlDb.
>
> The simplest way to do this is, I think, through a scoring filter API 
> - you can add your own filter, which during updatedb operation flags 
> unwanted urls (by means of putting a piece of metadata in CrawlDatum), 
> and then during the generate step it checks this metadata and returns 
> the generateScore = Float.MIN_VALUE - which means this page will never 
> be selected for fetching as long as there are other unfetched pages.
>
> You can also modify the Generator to completely skip such flagged pages.
>
Maybe we should permanently remove the urls that failed fetching k times 
from the crawldb, during updatedb operation. Since the web is highly 
dynamic there can be as many gone sites as new sites(or slightly less).  
As far as i know  once a url is entered to the crawldb it will stay 
there with one of the possible states : STATUS_DB_UNFETCHED,  
STATUS_DB_FETCHED, STATUS_DB_GONE, STATUS_LINKED. Am i right?

This way Otis's case will also be resolved.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to