[
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12492668
]
Michael McCandless commented on LUCENE-843:
-------------------------------------------
Results with the above patch:
RAM = 32 MB
NUM THREADS = 1
MERGE FACTOR = 10
2000000 DOCS @ ~550 bytes plain text
No term vectors nor stored fields
AUTOCOMMIT = true (commit whenever RAM is full)
old
2000000 docs in 782.8 secs
index size = 436M
new
2000000 docs in 93.4 secs
index size = 430M
Total Docs/sec: old 2554.8; new 21421.1 [ 738.5% faster]
Docs/MB @ flush: old 128.0; new 4058.0 [ 3069.6% more]
Avg RAM used (MB) @ flush: old 140.2; new 38.0 [ 72.9% less]
AUTOCOMMIT = false (commit only once at the end)
old
2000000 docs in 780.2 secs
index size = 436M
new
2000000 docs in 90.6 secs
index size = 427M
Total Docs/sec: old 2563.3; new 22086.8 [ 761.7% faster]
Docs/MB @ flush: old 128.0; new 4118.4 [ 3116.7% more]
Avg RAM used (MB) @ flush: old 144.6; new 36.4 [ 74.8% less]
With term vectors (positions + offsets) and 2 small stored fields
AUTOCOMMIT = true (commit whenever RAM is full)
old
2000000 docs in 1227.6 secs
index size = 2.1G
new
2000000 docs in 559.8 secs
index size = 2.1G
Total Docs/sec: old 1629.2; new 3572.5 [ 119.3% faster]
Docs/MB @ flush: old 93.1; new 4058.0 [ 4259.1% more]
Avg RAM used (MB) @ flush: old 193.4; new 38.5 [ 80.1% less]
AUTOCOMMIT = false (commit only once at the end)
old
2000000 docs in 1229.2 secs
index size = 2.1G
new
2000000 docs in 241.0 secs
index size = 2.1G
Total Docs/sec: old 1627.0; new 8300.0 [ 410.1% faster]
Docs/MB @ flush: old 93.1; new 4118.4 [ 4323.9% more]
Avg RAM used (MB) @ flush: old 150.5; new 38.4 [ 74.5% less]
200000 DOCS @ ~5,500 bytes plain text
No term vectors nor stored fields
AUTOCOMMIT = true (commit whenever RAM is full)
old
200000 docs in 352.2 secs
index size = 406M
new
200000 docs in 86.4 secs
index size = 403M
Total Docs/sec: old 567.9; new 2313.7 [ 307.4% faster]
Docs/MB @ flush: old 83.5; new 420.0 [ 402.7% more]
Avg RAM used (MB) @ flush: old 76.8; new 38.1 [ 50.4% less]
AUTOCOMMIT = false (commit only once at the end)
old
200000 docs in 399.2 secs
index size = 406M
new
200000 docs in 89.6 secs
index size = 400M
Total Docs/sec: old 501.0; new 2231.0 [ 345.3% faster]
Docs/MB @ flush: old 83.5; new 422.6 [ 405.8% more]
Avg RAM used (MB) @ flush: old 76.7; new 36.2 [ 52.7% less]
With term vectors (positions + offsets) and 2 small stored fields
AUTOCOMMIT = true (commit whenever RAM is full)
old
200000 docs in 594.2 secs
index size = 1.7G
new
200000 docs in 229.0 secs
index size = 1.7G
Total Docs/sec: old 336.6; new 873.3 [ 159.5% faster]
Docs/MB @ flush: old 47.9; new 420.0 [ 776.9% more]
Avg RAM used (MB) @ flush: old 157.8; new 38.0 [ 75.9% less]
AUTOCOMMIT = false (commit only once at the end)
old
200000 docs in 605.1 secs
index size = 1.7G
new
200000 docs in 181.3 secs
index size = 1.7G
Total Docs/sec: old 330.5; new 1103.1 [ 233.7% faster]
Docs/MB @ flush: old 47.9; new 422.6 [ 782.2% more]
Avg RAM used (MB) @ flush: old 132.0; new 37.1 [ 71.9% less]
20000 DOCS @ ~55,000 bytes plain text
No term vectors nor stored fields
AUTOCOMMIT = true (commit whenever RAM is full)
old
20000 docs in 180.8 secs
index size = 350M
new
20000 docs in 79.1 secs
index size = 349M
Total Docs/sec: old 110.6; new 252.8 [ 128.5% faster]
Docs/MB @ flush: old 25.0; new 46.8 [ 87.0% more]
Avg RAM used (MB) @ flush: old 112.2; new 44.3 [ 60.5% less]
AUTOCOMMIT = false (commit only once at the end)
old
20000 docs in 180.1 secs
index size = 350M
new
20000 docs in 75.9 secs
index size = 347M
Total Docs/sec: old 111.0; new 263.5 [ 137.3% faster]
Docs/MB @ flush: old 25.0; new 47.5 [ 89.7% more]
Avg RAM used (MB) @ flush: old 111.1; new 42.5 [ 61.7% less]
With term vectors (positions + offsets) and 2 small stored fields
AUTOCOMMIT = true (commit whenever RAM is full)
old
20000 docs in 323.1 secs
index size = 1.4G
new
20000 docs in 183.9 secs
index size = 1.4G
Total Docs/sec: old 61.9; new 108.7 [ 75.7% faster]
Docs/MB @ flush: old 10.4; new 46.8 [ 348.3% more]
Avg RAM used (MB) @ flush: old 74.2; new 44.9 [ 39.5% less]
AUTOCOMMIT = false (commit only once at the end)
old
20000 docs in 323.5 secs
index size = 1.4G
new
20000 docs in 135.6 secs
index size = 1.4G
Total Docs/sec: old 61.8; new 147.5 [ 138.5% faster]
Docs/MB @ flush: old 10.4; new 47.5 [ 354.8% more]
Avg RAM used (MB) @ flush: old 74.3; new 42.9 [ 42.2% less]
> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.2
> Reporter: Michael McCandless
> Assigned To: Michael McCandless
> Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch,
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents. I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
> * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
> * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges. Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
> * Recycle objects/buffers to reduce time/stress in GC.
> * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]