[Nutch-general] Re: Faster UpdateDB

Gal Nitzan Sun, 02 Oct 2005 11:06:56 -0700

Hi Jon,

If I understand it correctly, it is 26 regex matcher calls for each url !


Gal

Jon Shoberg wrote:

I'm using the whole web crawl strategy with Nutch 0.7.

I have 26 statements in regex-urlfilter.txt.

There are 6 regexs in regex-normalize.xml.

-j


Andy Liu wrote:

If you have a lot of regex expressions in your crawl-urlfilter.txt file,
that's probably what's making updatedb so slow. If you're just filtering

against a list of domains, I believe there's a new domain URL filterthat

was just added to JIRA which caches domain names and speeds things up
considerably.

Andy

On 9/30/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:

You can try to experiment with seetings in the nutch-config.xml
Open file streams, more cache for sorting things like that may help,
but also may crash the system because to many open files (under unix
this can be configured).
HTH
Stefan

Am 30.09.2005 um 18:31 schrieb Jon Shoberg:

Calling UpdateDB for my segments (500K) is pretty slow as a
relative obersvation.

Aside from bigger hardware, is ther anything that can be done to
speed up the update process? Can multiple segments update the DB
at the same time?

Any optimizations or suggested useages?

thanks
-j





-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Faster UpdateDB

Reply via email to