[Nutch-general] Filtering links from crawldb

Enzo Michelangeli Thu, 24 May 2007 05:23:26 -0700

I understand that "mergedb ... -filter" can be used to remove links that do 
not meet the filtering requirements of the active URLFilters. However, 
mergedb operates on the whole crawldb, and can be very slow. Is there a way 
of enforcing filtering at updatedb time, preventing the unfetchable links 
from entering the database in first place?


Similar issue with links that result in HTTP timeouts. How can I get rid of 
them, so that they don't come periodically back to slow down my fetching?

Thanks in advance,

Enzo


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Filtering links from crawldb

Reply via email to