Hi Michael,
I know you've got your hands full, but was wondering if you could
either post your benchmark code, or better yet, hook it into the
benchmarker contrib (it is quite easy).
Let me know if I can help,
Grant
On Jun 21, 2007, at 10:01 AM, Michael McCandless (JIRA) wrote:
[ https://issues.apache.org/jira/browse/LUCENE-843?
page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
tabpanel#action_12506907 ]
Michael McCandless commented on LUCENE-843:
-------------------------------------------
OK I ran tests comparing analyzer performance.
It's the same test framework as above, using the ~5,500 byte Europarl
docs with autoCommit=true, 32 MB RAM buffer, no stored fields nor
vectors, and CFS=false, indexing 200,000 documents.
The SimpleSpaceAnalyzer is my own whitespace analyzer that minimizes
GC cost by not allocating a Term or String for every token in every
document.
Each run is best time of 2 runs:
ANALYZER PATCH (sec) TRUNK (sec) SPEEDUP
SimpleSpaceAnalyzer 79.0 326.5 4.1 X
StandardAnalyzer 449.0 674.1 1.5 X
WhitespaceAnalyzer 104.0 338.9 3.3 X
SimpleAnalyzer 104.7 328.0 3.1 X
StandardAnalyzer is definiteely rather time consuming!
improve how IndexWriter uses RAM to buffer added documents
----------------------------------------------------------
Key: LUCENE-843
URL: https://issues.apache.org/jira/browse/LUCENE-843
Project: Lucene - Java
Issue Type: Improvement
Components: Index
Affects Versions: 2.2
Reporter: Michael McCandless
Assignee: Michael McCandless
Priority: Minor
Attachments: index.presharedstores.cfs.zip,
index.presharedstores.nocfs.zip, LUCENE-843.patch,
LUCENE-843.take2.patch, LUCENE-843.take3.patch,
LUCENE-843.take4.patch, LUCENE-843.take5.patch,
LUCENE-843.take6.patch, LUCENE-843.take7.patch,
LUCENE-843.take8.patch, LUCENE-843.take9.patch
I'm working on a new class (MultiDocumentWriter) that writes more
than
one document directly into a single Lucene segment, more efficiently
than the current approach.
This only affects the creation of an initial segment from added
documents. I haven't changed anything after that, eg how segments
are
merged.
The basic ideas are:
* Write stored fields and term vectors directly to disk (don't
use up RAM for these).
* Gather posting lists & term infos in RAM, but periodically do
in-RAM merges. Once RAM is full, flush buffers to disk (and
merge them later when it's time to make a real segment).
* Recycle objects/buffers to reduce time/stress in GC.
* Other various optimizations.
Some of these changes are similar to how KinoSearch builds a segment.
But, I haven't made any changes to Lucene's file format nor added
requirements for a global fields schema.
So far the only externally visible change is a new method
"setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
deprecated) so that it flushes according to RAM usage and not a fixed
number documents added.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]