I notice filtering urls is done in the output format until parsing.
Wouldn't it be better to filter it until updating crawlDb?
Sure it would require to have some more disk space but since parsing
is done until fetching it may be improve fetching speed.
Stefan
Am 08.03.2006 um 18:53 schrieb Doug Cutting:
[EMAIL PROTECTED] wrote:
Don't generate URLs that don't pass URLFilters.
Just to be clear, this is to support folks changing their filters
while they're crawling, right? We already filter before we put
things into the db, so we're filtering twice now, no? If so, then
perhaps there should be an option to disable this second filtering
for folks who don't change their filters?
Doug
---------------------------------------------------------------
company: http://www.media-style.com
forum: http://www.text-mining.org
blog: http://www.find23.net