[ https://issues.apache.org/jira/browse/LUCENE-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-1301: --------------------------------------- Attachment: LUCENE-1301.take3.patch Attached new rev of the patch. bq. I'm seeing NullPointerExceptions in TestStressIndexing2 though, I believe this patch fixes that. All tests should now pass. > Refactor DocumentsWriter > ------------------------ > > Key: LUCENE-1301 > URL: https://issues.apache.org/jira/browse/LUCENE-1301 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.3, 2.3.1, 2.3.2, 2.4 > Reporter: Michael McCandless > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.4 > > Attachments: LUCENE-1301.patch, LUCENE-1301.take2.patch, > LUCENE-1301.take3.patch > > > I've been working on refactoring DocumentsWriter to make it more > modular, so that adding new indexing functionality (like column-stride > stored fields, LUCENE-1231) is just a matter of adding a plugin into > the indexing chain. > This is an initial step towards flexible indexing (but there is still > alot more to do!). > And it's very much still a work in progress -- there are intemittant > thread safety issues, I need to add tests cases and test/iterate on > performance, many "nocommits", etc. This is a snapshot of my current > state... > The approach introduces "consumers" (abstract classes defining the > interface) at different levels during indexing. EG DocConsumer > consumes the whole document. DocFieldConsumer consumes separate > fields, one at a time. InvertedDocConsumer consumes tokens produced > by running each field through the analyzer. TermsHashConsumer writes > its own bytes into in-memory posting lists stored in byte slices, > indexed by term, etc. > DocumentsWriter*.java is then much simpler: it only interacts with a > DocConsumer and has no idea what that consumer is doing. Under that > DocConsumer there is a whole "indexing chain" that does the real work: > * NormsWriter holds norms in memory and then flushes them to _X.nrm. > * FreqProxTermsWriter holds postings data in memory and then flushes > to _X.frq/prx. > * StoredFieldsWriter flushes immediately to _X.fdx/fdt > * TermVectorsTermsWriter flushes immediately to _X.tvx/tvf/tvd > DocumentsWriter still manages things like flushing a segment, closing > doc stores, buffering & applying deletes, freeing memory, aborting > when necesary, etc. > In this first step, everything is package-private, and, the indexing > chain is hardwired (instantiated in DocumentsWriter) to the chain > currently matching Lucene trunk. Over time we can open this up. > There are no changes to the index file format. > For the most part this is just a [large] refactoring, except for these > two small actual changes: > * Improved concurrency with mixed large/small docs: previously the > thread state would be tied up when docs finished indexing > out-of-order. Now, it's not: instead I use a separate class to > hold any pending state to flush to the doc stores, and immediately > free up the thread state to index other docs. > * Buffered norms in memory now remain sparse, until flushed to the > _X.nrm file. Previously we would "fill holes" in norms in memory, > as we go, which could easily use way too much memory. Really this > isn't a solution to the problem of sparse norms (LUCENE-830); it > just delays that issue from causing memory blowup during indexing; > memory use will still blowup during searching. > I expect performance (indexing throughput) will be worse with this > change. I'll profile & iterate to minimize this, but I think we can > accept some loss. I also plan to measure benefit of manually > re-cycling RawPostingList instances from our own pool, vs letting GC > recycle them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]