Re: nutch crawl command takes 98% of cpu

Julien Nioche Fri, 28 Jan 2011 06:01:56 -0800

That's assuming that the problem comes from the parsing.
Alex, can you either run jstack on the process to see what is is hanging on
or do as Chris suggested?
Note that it is not recommended to upgrade to Tika 0.8 if you want to
process PDF docs because of an issue which will be resolved in the next Tika
release.
Another solution - if the problem comes from flv files and you are not
interested in them - is to add a URLFilter which will prevent such files to
be fetched.


Julien

On 28 January 2011 03:32, Alexis <[email protected]> wrote:

> Hi,
>
> I ran into the same issue as well with Nutch 1.2. You could fix it by
> upgrading the version of tika parser to at least 0.8. The lib can be
> found in the plugins/parse-tika/ directory of your Nutch release.
>
> This has already been mentioned twice in the mailing-list: See
> http://lucene.472066.n3.nabble.com/Full-CPU-usage-td1976780.html
>
> I hope this will help you out.
>
> Alexis
>
> On Fri, Jan 28, 2011 at 1:01 AM, Chris Woolum <[email protected]>
> wrote:
> > If you are looking at the tasktracker control panel, what does it show?
> > The link is http://localhost:50030
> >
> >
> > -----Original Message-----
> > From: [email protected] [mailto:[email protected]]
> > Sent: Thursday, January 27, 2011 3:01 PM
> > To: [email protected]
> > Subject: nutch crawl command takes 98% of cpu
> >
> > Hello,
> >
> > I run crawl command with -depth 7 -topN -1 on my linux box with 1.5Mps
> > internet, amd 3.1ghz processor,  4GB memory, Fedora Linux 14, nutch 1.2.
> > After 1-2 days nutch takes 98% of cpu. My seed file includes about 3500
> > domains and I put fetch.external links to false.
> >
> > Is this normal? If not, what can be done to improve it?
> >
> > Thanks.
> > Alex.
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: nutch crawl command takes 98% of cpu

Reply via email to