[ https://issues.apache.org/jira/browse/LUCENE-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-1125: --------------------------------------- Attachment: LUCENE-1125.patch Attached patch. I ran a test where I index the first 6M docs from Wikipedia preprocessed to 100 bytes each, using this alg: {code} analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker work.dir=/lucene/work doc.stored = true doc.term.vector = true doc.term.vector.positions = true doc.term.vector.offsets = true ram.flush.mb = 64 compound = false autocommit = false docs.file=/Volumes/External/lucene/wikifull100.txt doc.add.log.step=10000 directory=FSDirectory ResetSystemErase { "BuildIndex" CreateIndex [ { "AddDocs" AddDoc > : 1500000 ]: 4 CloseIndex } RepSumByPrefRound BuildIndex {code} With this fix, it takes 158.5 seconds. Without it, it takes 621.8 seconds = 3.9X slower! The fix is very low risk. All tests pass. Michael, I think we should spin 2.3 RC2 to include this fix? Sorry to only find it so late in the game :( > Excessive Arrays.fill(0) in DocumentsWriter drastically slows down small docs > (3.9X slowdown!) > ---------------------------------------------------------------------------------------------- > > Key: LUCENE-1125 > URL: https://issues.apache.org/jira/browse/LUCENE-1125 > Project: Lucene - Java > Issue Type: Bug > Components: Index > Affects Versions: 2.3 > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 2.3 > > Attachments: LUCENE-1125.patch > > > I've been doing some "final" performance testing of 2.3RC1 and > uncovered a fairly serious bug that adds a large fixed CPU cost when > documents have any term vector enabled fields. > The bug does not affect correctness, just performance. > Basically, for every document, we were calling Arrays.fill(0) on a > large (32 KB) byte array when in fact we only needed to zero a small > part of it. This only happens if term vectors are turned on, and is > especially devastating for small documents. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]