[
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502793
]
Michael McCandless commented on LUCENE-843:
-------------------------------------------
I ran a benchmark using more than 1 thread to do indexing, in order to
test & compare concurrency of trunk and the patch. The test is the
same as above, and runs on a 4 core Mac Pro (OS X) box with 4 drive
RAID 0 IO system.
Here are the raw results:
DOCS = ~5,500 bytes plain text
RAM = 32 MB
MERGE FACTOR = 10
With term vectors (positions + offsets) and 2 small stored fields
AUTOCOMMIT = false (commit only once at the end)
NUM THREADS = 1
new
200000 docs in 172.3 secs
index size = 1.7G
old
200000 docs in 539.5 secs
index size = 1.7G
Total Docs/sec: old 370.7; new 1161.0 [ 213.2% faster]
Docs/MB @ flush: old 47.9; new 334.6 [ 598.7% more]
Avg RAM used (MB) @ flush: old 131.9; new 33.1 [ 74.9% less]
NUM THREADS = 2
new
200001 docs in 130.8 secs
index size = 1.7G
old
200001 docs in 452.8 secs
index size = 1.7G
Total Docs/sec: old 441.7; new 1529.3 [ 246.2% faster]
Docs/MB @ flush: old 47.9; new 301.5 [ 529.7% more]
Avg RAM used (MB) @ flush: old 226.1; new 35.2 [ 84.4% less]
NUM THREADS = 3
new
200002 docs in 105.4 secs
index size = 1.7G
old
200002 docs in 428.4 secs
index size = 1.7G
Total Docs/sec: old 466.8; new 1897.9 [ 306.6% faster]
Docs/MB @ flush: old 47.9; new 277.8 [ 480.2% more]
Avg RAM used (MB) @ flush: old 289.8; new 37.0 [ 87.2% less]
NUM THREADS = 4
new
200003 docs in 104.8 secs
index size = 1.7G
old
200003 docs in 440.4 secs
index size = 1.7G
Total Docs/sec: old 454.1; new 1908.5 [ 320.3% faster]
Docs/MB @ flush: old 47.9; new 259.9 [ 442.9% more]
Avg RAM used (MB) @ flush: old 293.7; new 37.1 [ 87.3% less]
NUM THREADS = 5
new
200004 docs in 99.5 secs
index size = 1.7G
old
200004 docs in 425.0 secs
index size = 1.7G
Total Docs/sec: old 470.6; new 2010.5 [ 327.2% faster]
Docs/MB @ flush: old 47.9; new 245.3 [ 412.6% more]
Avg RAM used (MB) @ flush: old 390.9; new 38.3 [ 90.2% less]
NUM THREADS = 6
new
200005 docs in 106.3 secs
index size = 1.7G
old
200005 docs in 427.1 secs
index size = 1.7G
Total Docs/sec: old 468.2; new 1882.3 [ 302.0% faster]
Docs/MB @ flush: old 47.8; new 248.5 [ 419.3% more]
Avg RAM used (MB) @ flush: old 340.9; new 38.7 [ 88.6% less]
NUM THREADS = 7
new
200006 docs in 106.1 secs
index size = 1.7G
old
200006 docs in 435.2 secs
index size = 1.7G
Total Docs/sec: old 459.6; new 1885.3 [ 310.2% faster]
Docs/MB @ flush: old 47.8; new 248.7 [ 420.0% more]
Avg RAM used (MB) @ flush: old 408.6; new 39.1 [ 90.4% less]
NUM THREADS = 8
new
200007 docs in 109.0 secs
index size = 1.7G
old
200007 docs in 469.2 secs
index size = 1.7G
Total Docs/sec: old 426.3; new 1835.2 [ 330.5% faster]
Docs/MB @ flush: old 47.8; new 251.3 [ 425.5% more]
Avg RAM used (MB) @ flush: old 448.9; new 39.0 [ 91.3% less]
Some quick comments:
* Both trunk & the patch show speedups if you use more than 1 thread
to do indexing. This is expected since the machine has concurrency.
* The biggest speedup is from 1->2 threads but still good gains from
2->5 threads.
* Best seems to be 5 threads.
* The patch allows better concurrency: relatively speaking it speeds
up faster than the trunk (the % faster increases as we add
threads) as you increase # threads. I think this makes sense
because we flush less often with the patch, and, flushing is time
consuming and single threaded.
> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
> Key: LUCENE-843
> URL: https://issues.apache.org/jira/browse/LUCENE-843
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Index
> Affects Versions: 2.2
> Reporter: Michael McCandless
> Assignee: Michael McCandless
> Priority: Minor
> Attachments: LUCENE-843.patch, LUCENE-843.take2.patch,
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch,
> LUCENE-843.take6.patch, LUCENE-843.take7.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents. I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
> * Write stored fields and term vectors directly to disk (don't
> use up RAM for these).
> * Gather posting lists & term infos in RAM, but periodically do
> in-RAM merges. Once RAM is full, flush buffers to disk (and
> merge them later when it's time to make a real segment).
> * Recycle objects/buffers to reduce time/stress in GC.
> * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]