wow, impressive numbers, congrats !
----- Original Message ----
From: Michael McCandless (JIRA) <[EMAIL PROTECTED]>
To: [email protected]
Sent: Thursday, 5 April, 2007 3:22:32 PM
Subject: [jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to
buffer added documents
[
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486942
]
Michael McCandless commented on LUCENE-843:
-------------------------------------------
OK I ran old (trunk) vs new (this patch) with increasing RAM buffer
sizes up to 96 MB.
I used the "normal" sized docs (~5,500 bytes plain text), left stored
fields and term vectors (positions + offsets) on, and
autoCommit=false.
Here're the results:
NUM THREADS = 1
MERGE FACTOR = 10
With term vectors (positions + offsets) and 2 small stored fields
AUTOCOMMIT = false (commit only once at the end)
1 MB
old
200000 docs in 862.2 secs
index size = 1.7G
new
200000 docs in 297.1 secs
index size = 1.7G
Total Docs/sec: old 232.0; new 673.2 [ 190.2% faster]
Docs/MB @ flush: old 47.2; new 278.4 [ 489.6% more]
Avg RAM used (MB) @ flush: old 34.5; new 3.4 [ 90.1% less]
2 MB
old
200000 docs in 828.7 secs
index size = 1.7G
new
200000 docs in 279.0 secs
index size = 1.7G
Total Docs/sec: old 241.3; new 716.8 [ 197.0% faster]
Docs/MB @ flush: old 47.0; new 322.4 [ 586.7% more]
Avg RAM used (MB) @ flush: old 37.9; new 4.5 [ 88.0% less]
4 MB
old
200000 docs in 840.5 secs
index size = 1.7G
new
200000 docs in 260.8 secs
index size = 1.7G
Total Docs/sec: old 237.9; new 767.0 [ 222.3% faster]
Docs/MB @ flush: old 46.8; new 363.1 [ 675.4% more]
Avg RAM used (MB) @ flush: old 33.9; new 6.5 [ 80.9% less]
8 MB
old
200000 docs in 678.8 secs
index size = 1.7G
new
200000 docs in 248.8 secs
index size = 1.7G
Total Docs/sec: old 294.6; new 803.7 [ 172.8% faster]
Docs/MB @ flush: old 46.8; new 392.4 [ 739.1% more]
Avg RAM used (MB) @ flush: old 60.3; new 10.7 [ 82.2% less]
16 MB
old
200000 docs in 660.6 secs
index size = 1.7G
new
200000 docs in 247.3 secs
index size = 1.7G
Total Docs/sec: old 302.8; new 808.7 [ 167.1% faster]
Docs/MB @ flush: old 46.7; new 415.4 [ 788.8% more]
Avg RAM used (MB) @ flush: old 47.1; new 19.2 [ 59.3% less]
24 MB
old
200000 docs in 658.1 secs
index size = 1.7G
new
200000 docs in 243.0 secs
index size = 1.7G
Total Docs/sec: old 303.9; new 823.0 [ 170.8% faster]
Docs/MB @ flush: old 46.7; new 430.9 [ 822.2% more]
Avg RAM used (MB) @ flush: old 70.0; new 27.5 [ 60.8% less]
32 MB
old
200000 docs in 714.2 secs
index size = 1.7G
new
200000 docs in 239.2 secs
index size = 1.7G
Total Docs/sec: old 280.0; new 836.0 [ 198.5% faster]
Docs/MB @ flush: old 46.7; new 432.2 [ 825.2% more]
Avg RAM used (MB) @ flush: old 92.5; new 36.7 [ 60.3% less]
48 MB
old
200000 docs in 640.3 secs
index size = 1.7G
new
200000 docs in 236.0 secs
index size = 1.7G
Total Docs/sec: old 312.4; new 847.5 [ 171.3% faster]
Docs/MB @ flush: old 46.7; new 438.5 [ 838.8% more]
Avg RAM used (MB) @ flush: old 138.9; new 52.8 [ 62.0% less]
64 MB
old
200000 docs in 649.3 secs
index size = 1.7G
new
200000 docs in 238.3 secs
index size = 1.7G
Total Docs/sec: old 308.0; new 839.3 [ 172.5% faster]
Docs/MB @ flush: old 46.7; new 441.3 [ 844.7% more]
Avg RAM used (MB) @ flush: old 302.6; new 72.7 [ 76.0% less]
80 MB
old
200000 docs in 670.2 secs
index size = 1.7G
new
200000 docs in 227.2 secs
index size = 1.7G
Total Docs/sec: old 298.4; new 880.5 [ 195.0% faster]
Docs/MB @ flush: old 46.7; new 446.2 [ 855.2% more]
Avg RAM used (MB) @ flush: old 231.7; new 94.3 [ 59.3% less]
96 MB
old
200000 docs in 683.4 secs
index size = 1.7G
new
200000 docs in 226.8 secs
index size = 1.7G
Total Docs/sec: old 292.7; new 882.0 [ 201.4% faster]
Docs/MB @ flush: old 46.7; new 448.0 [ 859.1% more]
Avg RAM used (MB) @ flush: old 274.5; new 112.7 [ 59.0% less]
Some observations:
* Remember the test is already biased against "new" because with the
patch you get an optimized index in the end but with "old" you
don't.
* Sweet spot for old (trunk) seems to be 48 MB: that is the peak
docs/sec @ 312.4.
* New (with patch) seems to just get faster the more memory you give
it, though gradually. The peak was 96 MB (the largest I ran). So
no sweet spot (or maybe I need to give more memory, but, above 96
MB the trunk was starting to swap on my test env).
* New gets better and better RAM efficiency, the more RAM you give.
This makes sense: it's better able to compress the terms dict, the
more docs are merged in RAM before having to flush to disk. I
would also expect this curve to be somewhat content dependent.
> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.2
> Reporter: Michael McCandless
> Assigned To: Michael McCandless
> Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch,
> LUCENE-843.take3.patch, LUCENE-843.take4.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents. I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
> * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
> * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges. Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
> * Recycle objects/buffers to reduce time/stress in GC.
> * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
___________________________________________________________
What kind of emailer are you? Find out today - get a free analysis of your
email personality. Take the quiz at the Yahoo! Mail Championship.
http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]