[ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486292 ]
Michael McCandless commented on LUCENE-843: ------------------------------------------- Some details on how I measure RAM usage: both the baseline (current lucene trunk) and my patch have two general classes of RAM usage. The first class, "document processing RAM", is RAM used while processing a single doc. This RAM is re-used for each document (in the trunk, it's GC'd and new RAM is allocated; in my patch, I explicitly re-use these objects) and how large it gets is driven by how big each document is. The second class, "indexed documents RAM", is the RAM used up by previously indexed documents. This RAM grows with each added document and how large it gets is driven by the number and size of docs indexed since the last flush. So when I say the writer is allowed to use 32 MB of RAM, I'm only measuring the "indexed documents RAM". With trunk I do this by calling ramSizeInBytes(), and with my patch I do the analagous thing by measuring how many RAM buffers are held up storing previously indexed documents. I then define "RAM efficiency" (docs/MB) as how many docs we can hold in "indexed documents RAM" per MB RAM, at the point that we flush to disk. I think this is an important metric because it drives how large your initial (level 0) segments are. The larger these segments are then generally the less merging you need to do, for a given # docs in the index. I also measure overall RAM used in the JVM (using MemoryMXBean.getHeapMemoryUsage().getUsed()) just prior to each flush except the last, to also capture the "document processing RAM", object overhead, etc. > improve how IndexWriter uses RAM to buffer added documents > ---------------------------------------------------------- > > Key: LUCENE-843 > URL: https://issues.apache.org/jira/browse/LUCENE-843 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.2 > Reporter: Michael McCandless > Assigned To: Michael McCandless > Priority: Minor > Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, > LUCENE-843.take3.patch, LUCENE-843.take4.patch > > > I'm working on a new class (MultiDocumentWriter) that writes more than > one document directly into a single Lucene segment, more efficiently > than the current approach. > This only affects the creation of an initial segment from added > documents. I haven't changed anything after that, eg how segments are > merged. > The basic ideas are: > * Write stored fields and term vectors directly to disk (don't > use up RAM for these). > * Gather posting lists & term infos in RAM, but periodically do > in-RAM merges. Once RAM is full, flush buffers to disk (and > merge them later when it's time to make a real segment). > * Recycle objects/buffers to reduce time/stress in GC. > * Other various optimizations. > Some of these changes are similar to how KinoSearch builds a segment. > But, I haven't made any changes to Lucene's file format nor added > requirements for a global fields schema. > So far the only externally visible change is a new method > "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is > deprecated) so that it flushes according to RAM usage and not a fixed > number documents added. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]