the
solution for bot-traps and refetching in OC might be
able to be combined as one.
1) Refetching will look at the FetcherOutput of last
run, and queue the URLs according to their domain name
(for http 1.1 protocol) as your FetcherThread does.
2) We might just count the number of URLs within
Hi Kelvin:
1) bot-traps problem for OC
If we have a crawling depth for each starting host, it
seems that the crawling will be finalized in the end (
we can decrement depth value in each time the outlink
falls in same host domain).
Let me know if my thought is wrong.
2) refetching
If OC's
Michael,
On Sun, 28 Aug 2005 07:31:06 -0700 (PDT), Michael Ji wrote:
Hi Kelvin:
1) bot-traps problem for OC
If we have a crawling depth for each starting host, it seems that
the crawling will be finalized in the end ( we can decrement depth
value in each time the outlink falls in same host