[Nutch-general] Re: Faster UpdateDB

Andy Liu Sun, 02 Oct 2005 13:46:29 -0700

If you have a lot of regex expressions in your crawl-urlfilter.txt file,
that's probably what's making updatedb so slow. If you're just filtering
against a list of domains, I believe there's a new domain URL filter that
was just added to JIRA which caches domain names and speeds things up
considerably.


Andy

On 9/30/05, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
>
> You can try to experiment with seetings in the nutch-config.xml
> Open file streams, more cache for sorting things like that may help,
> but also may crash the system because to many open files (under unix
> this can be configured).
> HTH
> Stefan
>
> Am 30.09.2005 um 18:31 schrieb Jon Shoberg:
>
> > Calling UpdateDB for my segments (500K) is pretty slow as a
> > relative obersvation.
> >
> > Aside from bigger hardware, is ther anything that can be done to
> > speed up the update process? Can multiple segments update the DB
> > at the same time?
> >
> > Any optimizations or suggested useages?
> >
> > thanks
> > -j
> >
> >
> >
>
> ---------------------------------------------------------------
> company: http://www.media-style.com
> forum: http://www.text-mining.org
> blog: http://www.find23.net
>
>
>
>


--
Andy Liu
[EMAIL PROTECTED]
(301) 873-8458

[Nutch-general] Re: Faster UpdateDB

Reply via email to