Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Otis Gospodnetic Thu, 05 Apr 2007 13:26:53 -0700

Quick question, Mike:

You talk about a RAM buffer from 1MB - 96MB, but then you have the amount of 
RAM @ flush time (e.g. Avg RAM used (MB) @ flush:  old    34.5; new     3.4 [   
90.1% less]).


I don't follow 100% of what you are doing in LUCENE-843, so could you please 
explain what these 2 different amounts of RAM are?
Is the first (1-96) the RAM you use for in-memory merging of segments?
What is the RAM used @ flush?  More precisely, why is it that that amount of 
RAM exceeds the RAM buffer?

Thanks,
Otis



----- Original Message ----
From: Michael McCandless (JIRA) <[EMAIL PROTECTED]>
To: [email protected]
Sent: Thursday, April 5, 2007 9:22:32 AM
Subject: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to 
buffer added documents


    [ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486942
 ] 

Michael McCandless commented on LUCENE-843:
-------------------------------------------


OK I ran old (trunk) vs new (this patch) with increasing RAM buffer
sizes up to 96 MB.

I used the "normal" sized docs (~5,500 bytes plain text), left stored
fields and term vectors (positions + offsets) on, and
autoCommit=false.

Here're the results:

NUM THREADS = 1
MERGE FACTOR = 10
With term vectors (positions + offsets) and 2 small stored fields
AUTOCOMMIT = false (commit only once at the end)


1 MB

  old
    200000 docs in 862.2 secs
    index size = 1.7G

  new
    200000 docs in 297.1 secs
    index size = 1.7G

  Total Docs/sec:             old   232.0; new   673.2 [  190.2% faster]
  Docs/MB @ flush:            old    47.2; new   278.4 [  489.6% more]
  Avg RAM used (MB) @ flush:  old    34.5; new     3.4 [   90.1% less]



2 MB

  old
    200000 docs in 828.7 secs
    index size = 1.7G

  new
    200000 docs in 279.0 secs
    index size = 1.7G

  Total Docs/sec:             old   241.3; new   716.8 [  197.0% faster]
  Docs/MB @ flush:            old    47.0; new   322.4 [  586.7% more]
  Avg RAM used (MB) @ flush:  old    37.9; new     4.5 [   88.0% less]



4 MB

  old
    200000 docs in 840.5 secs
    index size = 1.7G

  new
    200000 docs in 260.8 secs
    index size = 1.7G

  Total Docs/sec:             old   237.9; new   767.0 [  222.3% faster]
  Docs/MB @ flush:            old    46.8; new   363.1 [  675.4% more]
  Avg RAM used (MB) @ flush:  old    33.9; new     6.5 [   80.9% less]



8 MB

  old
    200000 docs in 678.8 secs
    index size = 1.7G

  new
    200000 docs in 248.8 secs
    index size = 1.7G

  Total Docs/sec:             old   294.6; new   803.7 [  172.8% faster]
  Docs/MB @ flush:            old    46.8; new   392.4 [  739.1% more]
  Avg RAM used (MB) @ flush:  old    60.3; new    10.7 [   82.2% less]



16 MB

  old
    200000 docs in 660.6 secs
    index size = 1.7G

  new
    200000 docs in 247.3 secs
    index size = 1.7G

  Total Docs/sec:             old   302.8; new   808.7 [  167.1% faster]
  Docs/MB @ flush:            old    46.7; new   415.4 [  788.8% more]
  Avg RAM used (MB) @ flush:  old    47.1; new    19.2 [   59.3% less]



24 MB

  old
    200000 docs in 658.1 secs
    index size = 1.7G

  new
    200000 docs in 243.0 secs
    index size = 1.7G

  Total Docs/sec:             old   303.9; new   823.0 [  170.8% faster]
  Docs/MB @ flush:            old    46.7; new   430.9 [  822.2% more]
  Avg RAM used (MB) @ flush:  old    70.0; new    27.5 [   60.8% less]



32 MB

  old
    200000 docs in 714.2 secs
    index size = 1.7G

  new
    200000 docs in 239.2 secs
    index size = 1.7G

  Total Docs/sec:             old   280.0; new   836.0 [  198.5% faster]
  Docs/MB @ flush:            old    46.7; new   432.2 [  825.2% more]
  Avg RAM used (MB) @ flush:  old    92.5; new    36.7 [   60.3% less]



48 MB

  old
    200000 docs in 640.3 secs
    index size = 1.7G

  new
    200000 docs in 236.0 secs
    index size = 1.7G

  Total Docs/sec:             old   312.4; new   847.5 [  171.3% faster]
  Docs/MB @ flush:            old    46.7; new   438.5 [  838.8% more]
  Avg RAM used (MB) @ flush:  old   138.9; new    52.8 [   62.0% less]



64 MB

  old
    200000 docs in 649.3 secs
    index size = 1.7G

  new
    200000 docs in 238.3 secs
    index size = 1.7G

  Total Docs/sec:             old   308.0; new   839.3 [  172.5% faster]
  Docs/MB @ flush:            old    46.7; new   441.3 [  844.7% more]
  Avg RAM used (MB) @ flush:  old   302.6; new    72.7 [   76.0% less]



80 MB

  old
    200000 docs in 670.2 secs
    index size = 1.7G

  new
    200000 docs in 227.2 secs
    index size = 1.7G

  Total Docs/sec:             old   298.4; new   880.5 [  195.0% faster]
  Docs/MB @ flush:            old    46.7; new   446.2 [  855.2% more]
  Avg RAM used (MB) @ flush:  old   231.7; new    94.3 [   59.3% less]



96 MB

  old
    200000 docs in 683.4 secs
    index size = 1.7G

  new
    200000 docs in 226.8 secs
    index size = 1.7G

  Total Docs/sec:             old   292.7; new   882.0 [  201.4% faster]
  Docs/MB @ flush:            old    46.7; new   448.0 [  859.1% more]
  Avg RAM used (MB) @ flush:  old   274.5; new   112.7 [   59.0% less]


Some observations:

  * Remember the test is already biased against "new" because with the
    patch you get an optimized index in the end but with "old" you
    don't.

  * Sweet spot for old (trunk) seems to be 48 MB: that is the peak
    docs/sec @ 312.4.

  * New (with patch) seems to just get faster the more memory you give
    it, though gradually.  The peak was 96 MB (the largest I ran).  So
    no sweet spot (or maybe I need to give more memory, but, above 96
    MB the trunk was starting to swap on my test env).

  * New gets better and better RAM efficiency, the more RAM you give.
    This makes sense: it's better able to compress the terms dict, the
    more docs are merged in RAM before having to flush to disk.  I
    would also expect this curve to be somewhat content dependent.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>         Assigned To: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Reply via email to