[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Michael Busch (JIRA) Thu, 21 Jun 2007 10:09:53 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506961
 ]


Michael Busch commented on LUCENE-843:
--------------------------------------

> OK I ran tests comparing analyzer performance.

Thanks for the numbers Mike. Yes the gain is less with StandardAnalyzer
but 1.5X faster is still very good!


I have some question about the extensibility of your code. For flexible
indexing we want to be able in the future to implement different posting
formats and we might even want to allow our users to implement own 
posting formats.

When I implemented multi-level skipping I tried to keep this in mind. 
Therefore I put most of the functionality in the two abstract classes
MultiLevelSkipListReader/Writer. Subclasses implement the actual format
of the skip data. I think with this design it should be quite easy to
implement different formats in the future while limiting the code
complexity.

With the old DocumentWriter I think this is quite simple to do too by
adding a class like PostingListWriter, where subclasses define the actual 
format (because DocumentWriter is so simple).

Do you think your code is easily extensible in this regard? I'm 
wondering because of all the optimizations you're doing like e. g.
sharing byte arrays. But I'm certainly not familiar enough with your code 
yet, so I'm only guessing here.


> improve how IndexWriter uses RAM to buffer added documents
> ----------------------------------------------------------
>
>                 Key: LUCENE-843
>                 URL: https://issues.apache.org/jira/browse/LUCENE-843
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.2
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: index.presharedstores.cfs.zip, 
> index.presharedstores.nocfs.zip, LUCENE-843.patch, LUCENE-843.take2.patch, 
> LUCENE-843.take3.patch, LUCENE-843.take4.patch, LUCENE-843.take5.patch, 
> LUCENE-843.take6.patch, LUCENE-843.take7.patch, LUCENE-843.take8.patch, 
> LUCENE-843.take9.patch
>
>
> I'm working on a new class (MultiDocumentWriter) that writes more than
> one document directly into a single Lucene segment, more efficiently
> than the current approach.
> This only affects the creation of an initial segment from added
> documents.  I haven't changed anything after that, eg how segments are
> merged.
> The basic ideas are:
>   * Write stored fields and term vectors directly to disk (don't
>     use up RAM for these).
>   * Gather posting lists & term infos in RAM, but periodically do
>     in-RAM merges.  Once RAM is full, flush buffers to disk (and
>     merge them later when it's time to make a real segment).
>   * Recycle objects/buffers to reduce time/stress in GC.
>   * Other various optimizations.
> Some of these changes are similar to how KinoSearch builds a segment.
> But, I haven't made any changes to Lucene's file format nor added
> requirements for a global fields schema.
> So far the only externally visible change is a new method
> "setRAMBufferSize" in IndexWriter (and setMaxBufferedDocs is
> deprecated) so that it flushes according to RAM usage and not a fixed
> number documents added.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-843) improve how IndexWriter uses RAM to buffer added documents

Reply via email to