Hi all, in my application I often need to perform the inject ->
generate -> .. -> index loop multiple times, since users can 'suggest'
new web pages to be crawled and indexed.
I also need to enable the language identifier plugin.

Everything seems to work correctly, but after some time I get an
OutOfMemoryException. Actually the time isn't important, since I
noticed that the problem arises when the user submits many urls
(~100). As I said, for each submitted url a new loop is performed
(similar to the one in the Crawl.main method).

Using a profiler (specifically, netbeans profiler) I found out that
for each submitted url a new LanguageIdentifier instance is created,
and never released. With the memory inspector tool I can see as many
instances of LanguageIdentifier and NGramProfile$NGramEntry as the
number of fetched pages, each of them occupying about 180kb. Forcing
garbage collection doesn't release much memory.

LanguageIdentifier has a static class variable 'identifier' that is
never used; reading through the code it seems that the original idea
was to implement a singleton pattern.
So, to limit memory usage, I implemented a static getInstance method
and modified the LanguageIndexingFilter class making it to use the
singleton.
Since I was still having some strange results with the profiler, I
added a println message in the getInstance method, to monitor
effectively singleton creation. It turns out that the singleton is
re-istantiated each time!
I can't really understand why this is happening, maybe is something
related to hadoop internals?

Cheers,
Enrico

Reply via email to