Plugin repository cache can lead to memory leak
-----------------------------------------------

                 Key: NUTCH-356
                 URL: http://issues.apache.org/jira/browse/NUTCH-356
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 0.8
            Reporter: Enrico Triolo
         Attachments: NutchTest.java, patch.txt

While I was trying to solve a problem I reported a while ago (see Nutch-314), I 
found out that actually the problem was related to the plugin cache used in 
class PluginRepository.java.
As  I said in Nutch-314, I think I somehow 'force' the way nutch is meant to 
work, since I need to frequently submit new urls and append their contents to 
the index; I don't (and I can't) have an urls.txt file with all urls I'm going 
to fetch, but I recreate it each time a new url is submitted.
Thus,  I think in the majority of times you won't have problems using nutch 
as-is, since the problem I found occours only if nutch is used in a way similar 
to the one I use.
To simplify your test I'm attaching a class that performs something similar to 
what I need. It fetches and index some sample urls; to avoid webmasters 
complaints I left the sample urls list empty, so you should modify the source 
code and add some urls.
Then you only have to run it and watch your memory consumption with top. In my 
experience I get an OutOfMemoryException after a couple of minutes, but it 
clearly depends on your heap settings and on the plugins you are using (I'm 
using 
'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic').

The problem is bound to the PluginRepository 'singleton' instance, since it 
never get released. It seems that some class maintains a reference to it and 
this class is never released since it is cached somewhere in the configuration.

So I modified the PluginRepository's 'get' method so that it never uses the 
cache and always returns a new instance (you can find the patch in attachment). 
This way the memory consumption is always stable and I get no OOM anymore.
Clearly this is not the solution, since I guess there are many performance 
issues involved, but for the moment it works.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to