fetcher improvements (was: Re: 0.8 much slower than 0.7)

Sami Siren Tue, 01 Aug 2006 09:05:29 -0700

Stefan Groschupf wrote:

Hi,
I have some code using queue based mechanism and java nio.
In my tests it is 4 times faster than the existing fetcher.
But:
+ I need to fix some more bugs
+ we need to re factor the robots.txt part since it is not usableoutside the http protocols yet.


IMO, also the code for politeness should be taken out from http
and make it protocol independent.

+ the fetcher does not support plug able protocols - only http.

I see two ways to go.
Refactor the existing robots txt parser and handle but this is a bigchange.

We should do refactoring, because it would creatly benefit the currentfetcher also if we could schedule fetching of robots.txt before we tryto get the content itself. eg. fetch the first 100's sites robots.txtand after that start fetching content and unseen robots.txts for sitesstill on queue (just an example).

Or I may be prefer reimplement robots.txt parsing and handling, thisrequire some more time for me.
In general we should move this discussion into nutch-dev since thereare more site effects we should discuss.


now we have it here.

The new fetcher should be an alternative and we should not just removethe old fetcher.


+1

--
 Sami Siren

fetcher improvements (was: Re: 0.8 much slower than 0.7)

Reply via email to