Hi all, in my application I often need to perform the inject -> generate -> .. -> index loop multiple times, since users can 'suggest' new web pages to be crawled and indexed. I also need to enable the language identifier plugin.
Everything seems to work correctly, but after some time I get an OutOfMemoryException. Actually the time isn't important, since I noticed that the problem arises when the user submits many urls (~100). As I said, for each submitted url a new loop is performed (similar to the one in the Crawl.main method). Using a profiler (specifically, netbeans profiler) I found out that for each submitted url a new LanguageIdentifier instance is created, and never released. With the memory inspector tool I can see as many instances of LanguageIdentifier and NGramProfile$NGramEntry as the number of fetched pages, each of them occupying about 180kb. Forcing garbage collection doesn't release much memory. LanguageIdentifier has a static class variable 'identifier' that is never used; reading through the code it seems that the original idea was to implement a singleton pattern. So, to limit memory usage, I implemented a static getInstance method and modified the LanguageIndexingFilter class making it to use the singleton. Since I was still having some strange results with the profiler, I added a println message in the getInstance method, to monitor effectively singleton creation. It turns out that the singleton is re-istantiated each time! I can't really understand why this is happening, maybe is something related to hadoop internals? Cheers, Enrico