Version:  Nutch 0.9 (but this applies to just about all versions)

I'm really in a bind.

Is anyone crawling from within a web application, or is everyone
running Nutch using the shell scripts provided?  I am trying to write
a web application around the Nutch crawling facilities, but it seems
that there is are huge memory issues when trying to do this.   The
container (tomcat 5.5.17 with 1.5 gigs of memory allocated, and 128K
on the stack) runs out of memory in less that an hour.  When profiling
version 0.7.2 we can see that there is a constant pool of objects that
grow, but never get garbage collected.  So, even when the crawl is
finished, these objects tend to just hang around forever, until we get
the wonderful: java.lang.OutOfMemoryError: PermGen space.  I updated
the application to use Nutch 0.9 and the problem got about 80x worse
(it use to run for about 16 hours, now it runs out of memory in 20
minutes).  We were using 5 concurrent crawlers, meaning we have
Crawl.man running 5 times within the application.

So, the current design is/was to have an event happen within the
system, which would fire off a crawler (currently just calls
org.apache.nutch.crawl.Crawl.main()).  But, this has caused nothing
but grief.  We need to have several crawlers running concurrently. We
didn't want large 'batch' jobs.  The requirement is to crawl a domain
as it comes into the system and not wait for days or hours to run the
job.

Has anyone else attempted to run the crawl in this manner?  Have you
run into the same problems?  Does controlling the fetcher and all the
other instances needed for crawling solve this issue?  There is
nothing in the org.apache.nutch.crawl.Crawl instance, from what I had
seen in the past, that would cause such a memory leak.  This must be
way down somewhere else in the code.

Since Nutch handles so much of its threading, could this be causing the problem?

I am not sure if I should x-post this to the dev group or not.

Anyway, thanks.

Briggs



-- 
"Conscious decisions by conscious minds are what make reality real"

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to