Version: Nutch 0.9 (but this applies to just about all versions) I'm really in a bind.
Is anyone crawling from within a web application, or is everyone running Nutch using the shell scripts provided? I am trying to write a web application around the Nutch crawling facilities, but it seems that there is are huge memory issues when trying to do this. The container (tomcat 5.5.17 with 1.5 gigs of memory allocated, and 128K on the stack) runs out of memory in less that an hour. When profiling version 0.7.2 we can see that there is a constant pool of objects that grow, but never get garbage collected. So, even when the crawl is finished, these objects tend to just hang around forever, until we get the wonderful: java.lang.OutOfMemoryError: PermGen space. I updated the application to use Nutch 0.9 and the problem got about 80x worse (it use to run for about 16 hours, now it runs out of memory in 20 minutes). We were using 5 concurrent crawlers, meaning we have Crawl.man running 5 times within the application. So, the current design is/was to have an event happen within the system, which would fire off a crawler (currently just calls org.apache.nutch.crawl.Crawl.main()). But, this has caused nothing but grief. We need to have several crawlers running concurrently. We didn't want large 'batch' jobs. The requirement is to crawl a domain as it comes into the system and not wait for days or hours to run the job. Has anyone else attempted to run the crawl in this manner? Have you run into the same problems? Does controlling the fetcher and all the other instances needed for crawling solve this issue? There is nothing in the org.apache.nutch.crawl.Crawl instance, from what I had seen in the past, that would cause such a memory leak. This must be way down somewhere else in the code. Since Nutch handles so much of its threading, could this be causing the problem? I am not sure if I should x-post this to the dev group or not. Anyway, thanks. Briggs -- "Conscious decisions by conscious minds are what make reality real" ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
