I understand that "mergedb ... -filter" can be used to remove links that do not meet the filtering requirements of the active URLFilters. However, mergedb operates on the whole crawldb, and can be very slow. Is there a way of enforcing filtering at updatedb time, preventing the unfetchable links from entering the database in first place?

Similar issue with links that result in HTTP timeouts. How can I get rid of them, so that they don't come periodically back to slow down my fetching?

Thanks in advance,

Enzo

Reply via email to