On Jan 24, 2008 5:47 AM, Michael McCandless <[EMAIL PROTECTED]> wrote:
>
> Yonik Seeley wrote:
>
> > On Jan 23, 2008 6:34 AM, Michael McCandless
> > <[EMAIL PROTECTED]> wrote:
> >> writer.freezeDocIDs();
> >> try {
> >> get docIDs from somewhere & call writer.deleteByDocID
> >> } finally {
> >> writer.unfreezeDocIDs();
> >> }
> >
> > Interesting idea, but would require the IndexWriter to flush the
> > buffered docs so an IndexReader could be created fro them. (or would
> > require the existence of an UnflushedDocumentsIndexReader)
>
> True.
>
> Actually, an UnflushedDocumentsIndexReader would not be hard!
>
> DocumentsWriter already has an IndexInput (ByteSliceReader) that can
> read the postings for a single term from the RAM buffer (this is used
> when flushing the segment). I think it'd be straightforward to get
> TermEnum/TermDocs/TermPositions iterators on the buffered docs.
> Norms are already stored as byte arrays in memory. FieldInfos is
> already available. The stored fields & term vectors are already
> flushed to the directory so they could be read normally.
>
> Hmm, buffered delete terms are tricky. I guess freezeDocIDs would
> have to flush deleted terms (and queries, if we add that) before
> making a reader accessible,
If we buffer queries, that would seem to take care of 99% of the
usecases that need an IndexReader, right? A custom query could get
ids from an index however it wanted.
> though, the cost is shared because the
> readers need to be opened anyway (so the app can find docIDs).
>
> So maybe this approach becomes this:
>
> // Returns a "point in time" frozen view of index...
> IndexReader reader = writer.getReader();
> try {
> <get docIDs from reader, delete by docID>
> } finally {
> writer.releaseReader();
> }
>
> ?
>
> We may even be able to implement this w/o actually freezing the
> writer,
> ie, still allowing add/updateDocument calls to proceed.
> Merging could certainly still proceed. This way you could at any
> time ask a writer for a "point in time" reader, independent of what
> else you are doing with the writer. This would require, on flushing,
> that writer goes and swaps in a "real" segment reader, limited to a
> specified docID, for any point in time readers that are open.
Wow... sounds complex.
> >> If we went that route, we'd need to expose methods in IndexWriter to
> >> let you get reader(s), and, to then delete by docID.
> >
> > Right... I had envisioned a callback that was called after a new
> > segment was created/flushed that passed IndexReader[]. In an
> > environment of mixed deletes and adds, it would avoid slowing down the
> > indexing part by limiting where the deletes happen.
>
> This would certainly be less work :) I guess the question is how
> severely are we limiting the application by requiring that you can
> only do deletes when IW decides to flush, or, by forcing the
> application to flush when it wants to do deletes.
Seems like more work, rather than limiting... "when" really isn't as
important as long as it's before a new external IndexReader is opened
for searching.
> > It does put a little more burden on the user, but a slightly harder
> > (but more powerful / more efficient) API is preferable since easier
> > APIs can always be built on top (but not vice-versa).
>
> True, though emulating the easier API on top of the "you get to
> delete only when IW flushes" means you are forcing a flush, right?
I was thinking via buffering (the same way term deletes are handled now).
You keep track of maxDoc() at the time of the delete and defer it until later.
-Yonik
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]