Stefan Groschupf wrote:
I notice filtering urls is done in the output format until parsing. Wouldn't it be better to filter it until updating crawlDb?

"Until" == "during" ?

As you observed, doing it at this stage saves space in segment data, and in consequence saves on processing time (no CPU/IO needed to process useless data, throw away junk as soon as possible).

Sure it would require to have some more disk space but since parsing is done until fetching it may be improve fetching speed.

Parsing is not always done at fetching stage (Fetcher.parsing == false).

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to