Jason Rutherglen wrote:

Mike,

The other issue that will occur that I addressed is the field caches.
The underlying smaller IndexReaders will need to be exposed because of
the field caching.  Currently in ocean realtime search the individual
readers are searched on using a MultiSearcher in order to search in
parallel and reuse the field caches. How will field caching work with
the IndexWriter approach?  It seems like it would need a dynamically
growing field cache array?  That is a bit tricky.  By doing in memory
merging in ocean, the field caches last longer and do not require
growing arrays.

First off, I think the combination of LUCENE-1231 and LUCENE-831, which should result in FieldCache that is "distributed" down to each SegmentReader and much faster to initialize, should make incrementally updating the FieldCache much more efficient (ie, on calling IndexReader.reopen, it should only be the new segments that need to populate their FieldCache).

Hopefully these land before real-time search, because then I have more API flexibility to expose column-stride fields on the in-RAM documents. There is still some trickiness, because an "ordinary" IndexWriter would never hold the column-stride fields in RAM. They'd be flushed to the Directory, immediately per document, just liked stored fields and term vectors are today. So, maybe, the first RAMReader you get from the IndexWriter would load back in these fields, triggering IndexWriter to add to them as documents are added (maybe using exponentially growing arrays as the underlying store, or, perhaps separate array fragments, to prevent synchronization when reading from them), such that subsequent reopens simply resync their max docID.

How do you plan to handle rapidly delete the docs of
the disk segments?  Can the SegmentReader clone patch be used for
this?

I was thinking we'd flush new .del files every time a reopen is called, but that could very well be costly. Instead, we can keep the deletes pending in the SegmentReaders we're holding open, and then go back to flushing on IndexWriter's normal schedule. Reopen then must only "materialize" any buffered deletes by Term & Query, unless we decide to move up that materialization into the actual delete cal, since we will have SegmentReaders open anyway. I think I'm leaning towards that approach... best to pay the cost as you go, instead of aggregated cost on reopen?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to