On Sat, Jul 02, 2011 at 06:21:07PM +0200, Knut Anders Hatlen wrote:
> Lubos Kosco <[email protected]> writes:
> 
> > Great job guys, looks awesome :)
> > such optimizations would be needed for other analyzers as well ...
> > e.g. in one of the Dougs pictures there were other analyzers taking up
> > a lot of memory too ...
> 
> Right, the way the analyzers work currently is sub-optimal in many ways
> when it comes to memory usage:

Yes. After looking at your changes I was wondering, what lucenene is
doing at all/why it doesn't need all the stuff in one big chunk etc..
Finally I came to the conclusion, that it should be possible to get
get the thing down to use much less memory and even to utilize
CPU-cores/-threads much better and cutting processing time down to ~50%
(at least if there is an idle core/thread available and indexing time is
about the time for file analyzing/parsing)...
  
> 1) They're optimized to minimize reading from disk. But because we do
> two passes per file (one to add it to the Lucene index and one to write
> its xref to disk), the entire file needs to be kept in memory to avoid
> reading it again in the second pass.

Yes. I also wondered, why it is necessary to keep xref in memory. If we
look at it a little bit closer, addFile and getDocument gives a hint (see
http://src.iws.cs.ovgu.de/source/xref/opengrok/src/org/opensolaris/opengrok/index/IndexDatabase.java#addFile
http://src.iws.cs.ovgu.de/source/xref/opengrok/src/org/opensolaris/opengrok/analysis/AnalyzerGuru.java#getDocument
). IMHO we don't need a 2nd getGenre() and than write the xrefs, but
pass an appropriate writer like fa.analyze(doc, in, xrefOut). Than
the analyzer can write out the xref stuff immediately and doesn't need
to keep a copy ...

> 2) Each analyzer instance preserves its read buffer for reuse. Because
> of (1) the read buffer will grow to fit the largest file the analyzer
> has seen. So once a big file is seen, the read buffer will grow and
> never shrink again.

Yepp, a reset() method could solve this problem, but:

> 3) The analyzers are cached in thread locals, so the total size of the
> read buffers may grow up to (number of indexer threads * number of
> analyzer classes * size of the largest file).

Exactly, today object creation is pretty fast and since there is no complex
computation required to setup an analyzer, IMHO there is no need at all 
to complicate it artificially. So I think, caching doesn't buy much
and all the thread local stuff (incl. env) is IMHO anyway wrong, i.e.
may break things anyway.

> Some possible experiments we could do and see how they affect both
> memory consumption and indexer performance:
> 
> a) Read the file from disk again when generating the xref. Then the
> analysis phase doesn't need to make the read buffer fit the entire file,
> so we avoid these enormous read buffers.
> 
> b) Remove the caching of analyzer instances in thread locals. Instead
> generate a fresh instance when needed. Then the unnecessarily big read
> buffers can be garbage collected.

Yepp. But I think, we can do it even better: At least for JarAnalyzer
a threaded solution seems to be more appropriate, i.e. piping the data
into the indexer via another thread.

The key is lucene's DocFieldProcessorPerThread#processDocument() - see
http://src.iws.cs.ovgu.de/source/xref/lucene-3.0.3/src/java/org/apache/lucene/index/DocFieldProcessorPerThread.java#237
Here we can see, that each field is processed one after the other and
processing basically involves obtaining the String|Reader|TokenStream
associated with the field and indexing it (see
http://src.iws.cs.ovgu.de/source/xref/lucene-3.0.3/src/java/org/apache/lucene/index/DocInverterPerField.java#61).

So if get it managed, that the "full" field (aka fulltext) is processed
first, we do not need to store the fullText at all, just the refs and
defs list for the tokenstreams processed later. And as already said,
if the jar analyzer writes out the xrefs immediately, we do not need to
store the xrefs as well. Both memory hoggers vanish.

Doing so is IMHO pretty easy: 
1) put the body of JarAnalyzer:analyze in a thread, which fills the
   fullText, refs and defs store and writes out the xrefs
2) if fullText gets processed before refs and defs, synchronization is
   needed for fullText, only. Here one may use either a
   PipedReader/Writer or IMHO better a LinkedBlockingDeque
3) For refs, defs one may use a List<List<String>> to avoid to much
   growing (as now, many little chunks instead of a single big one)

The tricky part is: Lucene expects the fields to have always the same
order. At one point they've choosen to ensure that automatically by
re-sorting the fields by their name before they start to process them.
So either we need to ship Lucene with a slightly modified
DocFieldProcessorPerThread or rename the field to 'Full', which possibly
needs adjustments wrt. searching.

Finally I guess, this would be doable for single files as well and would
also avoid problems with really big files (e.g. the last bug files wrt.
an XML file)...

What do you think?

Regards,
jel.
-- 
Otto-von-Guericke University     http://www.cs.uni-magdeburg.de/
Department of Computer Science   Geb. 29 R 027, Universitaetsplatz 2
39106 Magdeburg, Germany         Tel: +49 391 67 12768
-- 
Otto-von-Guericke University     http://www.cs.uni-magdeburg.de/
Department of Computer Science   Geb. 29 R 027, Universitaetsplatz 2
39106 Magdeburg, Germany         Tel: +49 391 67 12768
_______________________________________________
opengrok-dev mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/opengrok-dev

Reply via email to