Re: Realtime Search

Jason Rutherglen Thu, 29 Jan 2009 17:17:00 -0800

> We'd also need to ensure when a merge kicks off, the SegmentReaders
used by the merging are not newly reopened but also "borrowed" from


The IW merge code currently opens the SegmentReader with a 4096
buffer size (different than the 1024 default), how will this case be
handled?

> reopen would then flush any added docs to new segments

IR.reopen would call IW.flush?

> When IW.commit is called, it also then asks each SegmentReader to
commit. Ie, IR.commit would not be used.

Why is this? SegmentReader.commitChanges would be called instead?

> Then when reopen is called, we must internally reopen that clone()
such that its deleted docs are carried over to the newly reopened
reader and newly flushed docs from IW are visible as new
SegmentReaders.

If deletes are made to the external reader (meaning the one obtained
by IW.getReader), then deletes are made via IW.deleteDocument, then
reopen is called, what happens in this case? We will need to merge
the del docs from the internal clone into the newly reopened reader?

> the IR becomes transactional as well -- deletes are not visible
immediately until reopen is called

Interesting. I'd rather somehow merge the IW and external reader's
deletes, otherwise it seems like we're radically changing how IR
works. Perhaps the IW keeps a copy of the external IR that has the
write lock (thinking of IR.clone where the write lock is passed onto
the latest clone). This way IW.getReader is about the same as
reopen/clone (because it will call reopen on presumably the latest
IR).




On Sat, Jan 24, 2009 at 4:29 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> Jason Rutherglen wrote:
>
>  > "But I think for realtime we don't want to be using IW's deletion at
>> all.  We should do all deletes via the IndexReader.  In fact if IW has
>> handed out a reader (via getReader()) and that reader (or a reopened
>> derivative) remains open we may have to block deletions via IW.  Not
>> sure..."
>>
>> Can't IW use the IR to do it's deletions?  Currently deletions in IW are
>> implemented in DocumentsWriter.applyDeletes by loading a segment with
>> SegmentReader.get() and making the deletions which causes term index load
>> overhead per flush.  If IW has an internal IR then the deletion process can
>> use it (not SegmentReader.get) and there should not be a conflict anymore
>> between the IR and IW deletion processes.
>>
>
> Today, IW quickly opens each SegmentReader, applies deletes, then
> commits & closes it, because we have considered it too costly to leave
> these readers open.
>
> But if you've opened a persistent IR via the IndexWriter anyway, we
> should use the SegmentReaders from that IR instead.
>
> It seems like the joint IR+IW would allow you to do adds, deletes,
> setNorms, all of which are not visible in the exposed IR until
> IR.reopen is called.  reopen would then flush any added docs to new
> segments, materialize any buffered deletes into the BitVectors (or
> future transactional sorted int tree thingy), likewise for norms, and
> then return a new IR.
>
> Ie, the IR becomes transactional as well -- deletes are not visible
> immeidately until reopen is called (unlike today when you delete via
> IR).  I think this means, internally when IW wants to make changes to
> the shared IR, it should make a clone() and do the changes privately
> to that instance.  Then when reopen is called, we must internally
> reopen that clone() such that its deleted docs are carried over to the
> newly reopened reader and newly flushed docs from IW are visible as
> new SegmentReaders.
>
> And on reopen, the deletes should not be flushed to the Directory --
> they only need to be "moved" into each SegmentReader's deletedDocs.
> We'd also need to ensure when a merge kicks off, the SegmentReaders
> used by the merging are not newly reopened but also "borrowed" from
> the already open IR.  This could actually mean that some deleted docs
> get merged away before the deletions ever get flushed to the Directory.
>
>  > "we may have to block deletions via IW"
>>
>> Hopefully they can be buffered.
>>
>> Where else does the write lock need to be coordinated between IR and IW?
>>
>> > "somehow IW & IR have to "split" the write lock else we may
>> need to merge deletions somehow."
>>
>> This is a part I'd like to settle on before start of implementation.  It
>> looks like in IW deletes are buffered as terms or queries until flushed.  I
>> don't think there needs to be a lock until the flush is performed?
>>
>> For the merge changes to the index, the deletionpolicy can be used to
>> insure a reader still has access to the segments it needs from the main
>> directory.
>>
>
> The write lock is held to prevent multiple writers from buffering and
> then writing changes to the index.  Since we will have this joint
> IR/IW share state, as long as we properly synchronize/share things
> between IR/IW, it's fine if they both "share" the write lock.
>
> It seems like IR.reopen suddenly means "have IW materialize all
> pending stuff and give me a new reader", where stuff is adds &
> deletes.  Adds must materialize via the directory.  Deletes can
> materialize entirely in RAM.  Likewise for norms.
>
> When IW.commit is called, it also then asks each SegmentReader to
> commit.  Ie, IR.commit would not be used.
>
>  > "We have to test performance to measure the net add -> search latency.
>> For many apps this approach may be plenty fast.  If your IO system is
>> an SSD it could be extremely fast.  Swapping in RAMDir
>> just makes it faster w/o changing the basic approach."
>>
>> It is true that this is best way to start and in fact may be good enough
>> for many users.  It could help new users to expose a reader from IW so the
>> delineation between them is removed and Lucene becomes easier to use.
>>
>> At the very least this system allows concurrently updateable IR and IW due
>> to sharing the write lock something that has is currently incorrect in
>> Lucene.
>>
>
> I wouldn't call it "incorrect".  It was an explicit design tradeoff to
> make the division between IR & IW, and done for many good reasons.  We
> are now talking about relaxing that and it clearly raises a number of
> "challenging" issues...
>
>  > "Besides the transaction log (for crash recovery), which should fit
>> "above" Lucene nicely, what else is needed for realtime beyond the
>> single-transaction support Lucene already provides?"
>>
>> What we have described above (exposing IR via IW) will be sufficient and
>> realtime will live above it.
>>
>
> OK, good.
>
> In this model, the combined IR+IW is still jointly transactional, in
> that the IW's commit() method still behaves as it does today.  It's just
> that the IR that's linked to the IW is allowed to "see" changes, shared
> only in RAM, that a freshly opened IR on the index would not see until
> commit has been called.
>
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

Re: Realtime Search

Reply via email to