Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Grant Ingersoll Fri, 22 Jun 2007 12:40:18 -0700

Hi Michael,

I know you've got your hands full, but was wondering if you couldeither post your benchmark code, or better yet, hook it into thebenchmarker contrib (it is quite easy).


Let me know if I can help,
Grant

On Jun 21, 2007, at 10:01 AM, Michael McCandless (JIRA) wrote:

[ https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506907 ]


Michael McCandless commented on LUCENE-843:
-------------------------------------------

OK I ran tests comparing analyzer performance.

It's the same test framework as above, using the ~5,500 byte Europarl
docs with autoCommit=true, 32 MB RAM buffer, no stored fields nor
vectors, and CFS=false, indexing 200,000 documents.

The SimpleSpaceAnalyzer is my own whitespace analyzer that minimizes
GC cost by not allocating a Term or String for every token in every
document.

Each run is best time of 2 runs:

  ANALYZER            PATCH (sec) TRUNK (sec)  SPEEDUP
  SimpleSpaceAnalyzer  79.0       326.5        4.1 X
  StandardAnalyzer    449.0       674.1        1.5 X
  WhitespaceAnalyzer  104.0       338.9        3.3 X
  SimpleAnalyzer      104.7       328.0        3.1 X

StandardAnalyzer is definiteely rather time consuming!

improve how IndexWriter uses RAM to buffer added documents
----------------------------------------------------------

                Key: LUCENE-843
                URL: https://issues.apache.org/jira/browse/LUCENE-843
            Project: Lucene - Java
         Issue Type: Improvement
         Components: Index
   Affects Versions: 2.2
           Reporter: Michael McCandless
           Assignee: Michael McCandless
           Priority: Minor

Attachments: index.presharedstores.cfs.zip,index.presharedstores.nocfs.zip, LUCENE-843.patch,LUCENE-843.take2.patch, LUCENE-843.take3.patch,LUCENE-843.take4.patch, LUCENE-843.take5.patch,LUCENE-843.take6.patch, LUCENE-843.take7.patch,LUCENE-843.take8.patch, LUCENE-843.take9.patch

I'm working on a new class (MultiDocumentWriter) that writes morethan

one document directly into a single Lucene segment, more efficiently
than the current approach.
This only affects the creation of an initial segment from added

documents. I haven't changed anything after that, eg how segmentsare

merged.
The basic ideas are:
  * Write stored fields and term vectors directly to disk (don't
    use up RAM for these).
  * Gather posting lists & term infos in RAM, but periodically do
    in-RAM merges.  Once RAM is full, flush buffers to disk (and
    merge them later when it's time to make a real segment).
  * Recycle objects/buffers to reduce time/stress in GC.
  * Other various optimizations.
Some of these changes are similar to how KinoSearch builds a segment.
But, I haven't made any changes to Lucene's file format nor added
requirements for a global fields schema.
So far the only externally visible change is a new method
"setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
deprecated) so that it flushes according to RAM usage and not a fixed
number documents added.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Reply via email to