I'm running nutch with the command Crawl. So i guess I should proceed by
step and use the different command (inject, generate, fetch, update, etc
..). isn't it ?

On 6/20/07, Emmanuel JOKE <[EMAIL PROTECTED]> wrote:
Hi Guys,

I have a cluster of 2 machines. I tried to crawl some website which
contains
over 1M of pages. I notice that it takes fews days to complete the
crawl.
The logs said 0.5p/s at 200kb/s. It seems very slow. I would like to try
Fetcher2, i guess it might improve the performance.

It might be a stupid question but i'm wondering how to i setup my nutch
to
use Fetcher2 instead of Fetcher.
Could you help me to understand ?

Are you running nutch with 'crawl' command, with seperate commands
(inject, generate, fetch, etc.)or something else?

If you are running seperate commands, all you have to do is change
fetch to fetch2.


Beside, what is usually the standard to configure fetcher.server.delay,
I
was told that we should set this property to 1 second but i can see in
nutch-default.xml that it has been setup to 5. What is the best to do to
gain in term of performance and to stay enough polite ?

That's kind of between you and the server you are fetching but I
wouldn't recommend a delay lower than 5 seconds.


More tricks to gain performance are welcome

E



--
Doğacan Güney

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to