Re: Realtime Search for Social Networks Collaboration

Michael McCandless Fri, 19 Sep 2008 05:31:14 -0700


Jason Rutherglen wrote:

Mike,

The other issue that will occur that I addressed is the field caches.
The underlying smaller IndexReaders will need to be exposed because of
the field caching.  Currently in ocean realtime search the individual
readers are searched on using a MultiSearcher in order to search in
parallel and reuse the field caches. How will field caching work with
the IndexWriter approach?  It seems like it would need a dynamically
growing field cache array?  That is a bit tricky.  By doing in memory
merging in ocean, the field caches last longer and do not require
growing arrays.

First off, I think the combination of LUCENE-1231 and LUCENE-831,which should result in FieldCache that is "distributed" down to eachSegmentReader and much faster to initialize, should make incrementallyupdating the FieldCache much more efficient (ie, on callingIndexReader.reopen, it should only be the new segments that need topopulate their FieldCache).

Hopefully these land before real-time search, because then I have moreAPI flexibility to expose column-stride fields on the in-RAMdocuments. There is still some trickiness, because an "ordinary"IndexWriter would never hold the column-stride fields in RAM. They'dbe flushed to the Directory, immediately per document, just likedstored fields and term vectors are today. So, maybe, the firstRAMReader you get from the IndexWriter would load back in thesefields, triggering IndexWriter to add to them as documents are added(maybe using exponentially growing arrays as the underlying store, or,perhaps separate array fragments, to prevent synchronization whenreading from them), such that subsequent reopens simply resync theirmax docID.

How do you plan to handle rapidly delete the docs of
the disk segments?  Can the SegmentReader clone patch be used for
this?

I was thinking we'd flush new .del files every time a reopen iscalled, but that could very well be costly. Instead, we can keep thedeletes pending in the SegmentReaders we're holding open, and then goback to flushing on IndexWriter's normal schedule. Reopen then mustonly "materialize" any buffered deletes by Term & Query, unless wedecide to move up that materialization into the actual delete cal,since we will have SegmentReaders open anyway. I think I'm leaningtowards that approach... best to pay the cost as you go, instead ofaggregated cost on reopen?


Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

Reply via email to