Re: [opengrok-dev] [Bug 18602] JarAnalyzer can run out of memory

Knut Anders Hatlen Tue, 05 Jul 2011 03:06:17 -0700

Jens Elkner <[email protected]> writes:

> On Sat, Jul 02, 2011 at 06:21:07PM +0200, Knut Anders Hatlen wrote:
>> Lubos Kosco <[email protected]> writes:
>> 
>> > Great job guys, looks awesome :)
>> > such optimizations would be needed for other analyzers as well ...
>> > e.g. in one of the Dougs pictures there were other analyzers taking up
>> > a lot of memory too ...
>> 
>> Right, the way the analyzers work currently is sub-optimal in many ways
>> when it comes to memory usage:
>
> Yes. After looking at your changes I was wondering, what lucenene is
> doing at all/why it doesn't need all the stuff in one big chunk etc..
> Finally I came to the conclusion, that it should be possible to get
> get the thing down to use much less memory and even to utilize
> CPU-cores/-threads much better and cutting processing time down to ~50%
> (at least if there is an idle core/thread available and indexing time is
> about the time for file analyzing/parsing)...
>   
>> 1) They're optimized to minimize reading from disk. But because we do
>> two passes per file (one to add it to the Lucene index and one to write
>> its xref to disk), the entire file needs to be kept in memory to avoid
>> reading it again in the second pass.
>
> Yes. I also wondered, why it is necessary to keep xref in memory. If we
> look at it a little bit closer, addFile and getDocument gives a hint (see
> http://src.iws.cs.ovgu.de/source/xref/opengrok/src/org/opensolaris/opengrok/index/IndexDatabase.java#addFile
> http://src.iws.cs.ovgu.de/source/xref/opengrok/src/org/opensolaris/opengrok/analysis/AnalyzerGuru.java#getDocument
> ). IMHO we don't need a 2nd getGenre() and than write the xrefs, but
> pass an appropriate writer like fa.analyze(doc, in, xrefOut). Than
> the analyzer can write out the xref stuff immediately and doesn't need
> to keep a copy ...


Yes, for JarAnalyzer I think that would work, since it generates the
xref piece by piece interleaved with the tokenizers. I'm not sure it
would help in the general case, though, since the tokenizer is typically
implemented separately from the xref generator, and they both need the
file contents, so we somehow need to read the contents of the stream
twice. That could be done either as today (by keeping all the contents
of the stream in a buffer, at the expense of higher memory usage), or by
opening two FileReaders (possibly slower because of more I/O).

Yet another alternative is to enhance the xref generators with
capability to tokenize the stream, so that a single invokation of the
xref generator would produce both the list of symbols and the xref. That
would be a bigger task, since we'd have to rewrite all the xref
generators. Also, I think the current separation is cleaner, since much
of the xref code is messy enough even if we don't give it more
responsibilities.

>> 2) Each analyzer instance preserves its read buffer for reuse. Because
>> of (1) the read buffer will grow to fit the largest file the analyzer
>> has seen. So once a big file is seen, the read buffer will grow and
>> never shrink again.
>
> Yepp, a reset() method could solve this problem, but:
>
>> 3) The analyzers are cached in thread locals, so the total size of the
>> read buffers may grow up to (number of indexer threads * number of
>> analyzer classes * size of the largest file).
>
> Exactly, today object creation is pretty fast and since there is no complex
> computation required to setup an analyzer, IMHO there is no need at all 
> to complicate it artificially. So I think, caching doesn't buy much
> and all the thread local stuff (incl. env) is IMHO anyway wrong, i.e.
> may break things anyway.
>
>> Some possible experiments we could do and see how they affect both
>> memory consumption and indexer performance:
>> 
>> a) Read the file from disk again when generating the xref. Then the
>> analysis phase doesn't need to make the read buffer fit the entire file,
>> so we avoid these enormous read buffers.
>> 
>> b) Remove the caching of analyzer instances in thread locals. Instead
>> generate a fresh instance when needed. Then the unnecessarily big read
>> buffers can be garbage collected.
>
> Yepp. But I think, we can do it even better: At least for JarAnalyzer
> a threaded solution seems to be more appropriate, i.e. piping the data
> into the indexer via another thread.
>
> The key is lucene's DocFieldProcessorPerThread#processDocument() - see
> http://src.iws.cs.ovgu.de/source/xref/lucene-3.0.3/src/java/org/apache/lucene/index/DocFieldProcessorPerThread.java#237
> Here we can see, that each field is processed one after the other and
> processing basically involves obtaining the String|Reader|TokenStream
> associated with the field and indexing it (see
> http://src.iws.cs.ovgu.de/source/xref/lucene-3.0.3/src/java/org/apache/lucene/index/DocInverterPerField.java#61).
>
> So if get it managed, that the "full" field (aka fulltext) is processed
> first, we do not need to store the fullText at all, just the refs and
> defs list for the tokenstreams processed later. And as already said,
> if the jar analyzer writes out the xrefs immediately, we do not need to
> store the xrefs as well. Both memory hoggers vanish.
>
> Doing so is IMHO pretty easy: 
> 1) put the body of JarAnalyzer:analyze in a thread, which fills the
>    fullText, refs and defs store and writes out the xrefs
> 2) if fullText gets processed before refs and defs, synchronization is
>    needed for fullText, only. Here one may use either a
>    PipedReader/Writer or IMHO better a LinkedBlockingDeque
> 3) For refs, defs one may use a List<List<String>> to avoid to much
>    growing (as now, many little chunks instead of a single big one)
>
> The tricky part is: Lucene expects the fields to have always the same
> order. At one point they've choosen to ensure that automatically by
> re-sorting the fields by their name before they start to process them.
> So either we need to ship Lucene with a slightly modified
> DocFieldProcessorPerThread or rename the field to 'Full', which possibly
> needs adjustments wrt. searching.
>
> Finally I guess, this would be doable for single files as well and would
> also avoid problems with really big files (e.g. the last bug files wrt.
> an XML file)...
>
> What do you think?

I'm not sure. It might be the right thing to do for the JarAnalyzer, as
it's in many ways a different beast than the other analyzers. For
example, the problem with analyzing big XML files seems to be limited to
the expansion of the read buffer to hold the entire file. Xrefs are not
kept in memory in that analyzer, or in most other analyzers. I'd try to
solve the general problem with large files in the analyzers first, and
then revisit the jar analyzer later and see if it's still a problem when
that's done.

-- 
Knut Anders
_______________________________________________
opengrok-dev mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/opengrok-dev

Re: [opengrok-dev] [Bug 18602] JarAnalyzer can run out of memory

Reply via email to