Stefan Groschupf wrote:
Hi,
I have some code using queue based mechanism and java nio.
In my tests it is 4 times faster than the existing fetcher.
But:
+ I need to fix some more bugs
+ we need to re factor the robots.txt part since it is not usable
outside the http protocols yet.
IMO, also the code for politeness should be taken out from http
and make it protocol independent.
+ the fetcher does not support plug able protocols - only http.
I see two ways to go.
Refactor the existing robots txt parser and handle but this is a big
change.
We should do refactoring, because it would creatly benefit the current
fetcher also if we could schedule fetching of robots.txt before we try
to get the content itself. eg. fetch the first 100's sites robots.txt
and after that start fetching content and unseen robots.txts for sites
still on queue (just an example).
Or I may be prefer reimplement robots.txt parsing and handling, this
require some more time for me.
In general we should move this discussion into nutch-dev since there
are more site effects we should discuss.
now we have it here.
The new fetcher should be an alternative and we should not just remove
the old fetcher.
+1
--
Sami Siren