I notice filtering urls is done in the output format until parsing. Wouldn't it be better to filter it until updating crawlDb? Sure it would require to have some more disk space but since parsing is done until fetching it may be improve fetching speed.

Stefan

Am 08.03.2006 um 18:53 schrieb Doug Cutting:

[EMAIL PROTECTED] wrote:
Don't generate URLs that don't pass URLFilters.

Just to be clear, this is to support folks changing their filters while they're crawling, right? We already filter before we put things into the db, so we're filtering twice now, no? If so, then perhaps there should be an option to disable this second filtering for folks who don't change their filters?

Doug



---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net


Reply via email to