[ https://issues.apache.org/jira/browse/LUCENE-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-1301: --------------------------------------- Attachment: LUCENE-1301.patch New rev of the patch attached. I've fixed all nocommits. All tests pass. I believe this version is ready to commit! I'll wait a few more days before committing... I ran some indexing throughput tests, indexing Wikipedia docs from a line file using StandardAnalyzer. Each result is best of 4. Here's the alg: {code} analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer doc.maker=org.apache.lucene.benchmark.byTask.feeds.LineDocMaker docs.file=/Volumes/External/lucene/wiki.txt doc.stored = true doc.term.vector = true doc.add.log.step=2000 directory=FSDirectory autocommit=false compound=false work.dir=/lucene/work ram.flush.mb=64 { "Rounds" ResetSystemErase { "BuildIndex" - CreateIndex { "AddDocs" AddDoc > : 200000 - CloseIndex } NewRound } : 4 RepSumByPrefRound BuildIndex {code} Gives these results with term vectors & stored fields: {code} patch BuildIndex - - 1 - - 1 - - 200000 - - 900.4 - - 222.12 - 410,938,688 1,029,046,272 trunk BuildIndex - - 1 - - 1 - - 200000 - - 969.0 - - 206.39 - 400,372,256 1,029,046,272 2.3 BuildIndex 2 1 200002 905.4 220.89 391,630,016 1,029,046,272 {code} And without term vectors & stored fields: {code} patch BuildIndex - - 3 - - 1 - - 200000 - - 1,297.5 - - 154.15 - 399,966,592 1,029,046,272 trunk BuildIndex - - 1 - - 1 - - 200000 - - 1,372.5 - - 145.72 - 390,581,376 1,029,046,272 2.3 BuildIndex - - 1 - - 1 - - 200002 - - 1,308.5 - - 152.85 - 389,224,640 1,029,046,272 {code} So, the bad news is the refactoring had made things a bit (~5-7%) slower than the current trunk. But the good news is trunk was already 6-7% faster than 2.4, so they nearly cancel out. If I repeat these tests using tiny docs (~100 bytes per body) instead, indexing the first 10 million docs, the slowdown is worse (~13-15% vs trunk, ~11-13% vs 2.3)... I think it's because the additional method calls with the refactoring become a bigger part of the time. With term vectors & stored fields: {code} patch BuildIndex - - 3 - - 1 - 10000000 - 38,320.1 - - 260.96 - 313,980,832 1,029,046,272 trunk BuildIndex 2 1 10000000 45,194.1 221.27 414,987,072 1,029,046,272 2.3 BuildIndex - - 1 - - 1 - 10000002 - 42,861.4 - - 233.31 - 182,957,440 1,029,046,272 {code} Without term vectors & stored fields: {code} patch BuildIndex - - 1 - - 1 - 10000000 - 60,778.4 - - 164.53 - 341,611,456 1,029,046,272 trunk BuildIndex 2 1 10000000 68,387.8 146.23 405,388,960 1,029,046,272 2.3 BuildIndex 0 1 10000002 68,052.7 146.95 330,334,912 1,029,046,272 {code} I think these small slowdowns are worth the improvement in code clarity. > Refactor DocumentsWriter > ------------------------ > > Key: LUCENE-1301 > URL: https://issues.apache.org/jira/browse/LUCENE-1301 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.3, 2.3.1, 2.3.2, 2.4 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.4 > > Attachments: LUCENE-1301.patch, LUCENE-1301.patch, > LUCENE-1301.take2.patch, LUCENE-1301.take3.patch > > > I've been working on refactoring DocumentsWriter to make it more > modular, so that adding new indexing functionality (like column-stride > stored fields, LUCENE-1231) is just a matter of adding a plugin into > the indexing chain. > This is an initial step towards flexible indexing (but there is still > alot more to do!). > And it's very much still a work in progress -- there are intemittant > thread safety issues, I need to add tests cases and test/iterate on > performance, many "nocommits", etc. This is a snapshot of my current > state... > The approach introduces "consumers" (abstract classes defining the > interface) at different levels during indexing. EG DocConsumer > consumes the whole document. DocFieldConsumer consumes separate > fields, one at a time. InvertedDocConsumer consumes tokens produced > by running each field through the analyzer. TermsHashConsumer writes > its own bytes into in-memory posting lists stored in byte slices, > indexed by term, etc. > DocumentsWriter*.java is then much simpler: it only interacts with a > DocConsumer and has no idea what that consumer is doing. Under that > DocConsumer there is a whole "indexing chain" that does the real work: > * NormsWriter holds norms in memory and then flushes them to _X.nrm. > * FreqProxTermsWriter holds postings data in memory and then flushes > to _X.frq/prx. > * StoredFieldsWriter flushes immediately to _X.fdx/fdt > * TermVectorsTermsWriter flushes immediately to _X.tvx/tvf/tvd > DocumentsWriter still manages things like flushing a segment, closing > doc stores, buffering & applying deletes, freeing memory, aborting > when necesary, etc. > In this first step, everything is package-private, and, the indexing > chain is hardwired (instantiated in DocumentsWriter) to the chain > currently matching Lucene trunk. Over time we can open this up. > There are no changes to the index file format. > For the most part this is just a [large] refactoring, except for these > two small actual changes: > * Improved concurrency with mixed large/small docs: previously the > thread state would be tied up when docs finished indexing > out-of-order. Now, it's not: instead I use a separate class to > hold any pending state to flush to the doc stores, and immediately > free up the thread state to index other docs. > * Buffered norms in memory now remain sparse, until flushed to the > _X.nrm file. Previously we would "fill holes" in norms in memory, > as we go, which could easily use way too much memory. Really this > isn't a solution to the problem of sparse norms (LUCENE-830); it > just delays that issue from causing memory blowup during indexing; > memory use will still blowup during searching. > I expect performance (indexing throughput) will be worse with this > change. I'll profile & iterate to minimize this, but I think we can > accept some loss. I also plan to measure benefit of manually > re-cycling RawPostingList instances from our own pool, vs letting GC > recycle them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]