Re: Realtime Search for Social Networks Collaboration

Jason Rutherglen Fri, 19 Sep 2008 05:53:45 -0700

Hi Mike,

How does column stride fields work for StringIndex field caching?  I
have been working on the tag index which may be more suitable for
field caching and makes range queries faster.  It is something that
would be good to integrate into core Lucene as well.  It may be more
suitable for many situations.  Perhaps the column stride and tag index
can be merged?  What is the progress on cs?


> Reopen then must only "materialize" any
> buffered deletes by Term & Query, unless we decide to move up that
> materialization into the actual delete cal, since we will have
> SegmentReaders open anyway.  I think I'm leaning towards that approach...
> best to pay the cost as you go, instead of aggregated cost on reopen?

I don't follow this part.  There is an IndexReader exposed from
IndexWriter.  I think the individual SegmentReaders should be exposed
as well, I don't see any reason not to and there are many cases where
it has been frustrating that SegmentReaders are package protected.  I
am not sure from what you mentioned how the deletedDocs bitvector is
handled.

On Fri, Sep 19, 2008 at 8:30 AM, Michael McCandless
<[EMAIL PROTECTED]> wrote:
>
> Jason Rutherglen wrote:
>
>> Mike,
>>
>> The other issue that will occur that I addressed is the field caches.
>> The underlying smaller IndexReaders will need to be exposed because of
>> the field caching.  Currently in ocean realtime search the individual
>> readers are searched on using a MultiSearcher in order to search in
>> parallel and reuse the field caches. How will field caching work with
>> the IndexWriter approach?  It seems like it would need a dynamically
>> growing field cache array?  That is a bit tricky.  By doing in memory
>> merging in ocean, the field caches last longer and do not require
>> growing arrays.
>
> First off, I think the combination of LUCENE-1231 and LUCENE-831, which
> should result in FieldCache that is "distributed" down to each SegmentReader
> and much faster to initialize, should make incrementally updating the
> FieldCache much more efficient (ie, on calling IndexReader.reopen, it should
> only be the new segments that need to populate their FieldCache).
>
> Hopefully these land before real-time search, because then I have more API
> flexibility to expose column-stride fields on the in-RAM documents.  There
> is still some trickiness, because an "ordinary" IndexWriter would never hold
> the column-stride fields in RAM.  They'd be flushed to the Directory,
> immediately per document, just liked stored fields and term vectors are
> today.  So, maybe, the first RAMReader you get from the IndexWriter would
> load back in these fields, triggering IndexWriter to add to them as
> documents are added (maybe using exponentially growing arrays as the
> underlying store, or, perhaps separate array fragments, to prevent
> synchronization when reading from them), such that subsequent reopens simply
> resync their max docID.
>
>> How do you plan to handle rapidly delete the docs of
>> the disk segments?  Can the SegmentReader clone patch be used for
>> this?
>
> I was thinking we'd flush new .del files every time a reopen is called, but
> that could very well be costly.  Instead, we can keep the deletes pending in
> the SegmentReaders we're holding open, and then go back to flushing on
> IndexWriter's normal schedule.  Reopen then must only "materialize" any
> buffered deletes by Term & Query, unless we decide to move up that
> materialization into the actual delete cal, since we will have
> SegmentReaders open anyway.  I think I'm leaning towards that approach...
> best to pay the cost as you go, instead of aggregated cost on reopen?
>
> Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Realtime Search for Social Networks Collaboration

Reply via email to