Re: Future projects

Michael McCandless Thu, 02 Apr 2009 14:42:26 -0700

On Thu, Apr 2, 2009 at 4:43 PM, Jason Rutherglen
<jason.rutherg...@gmail.com> wrote:
>> What does Bobo use the cached bitsets for?
>
> Bobo is a faceting engine that uses custom field caches and sometimes cached
> bitsets rather than relying exclusively on bitsets to calculate facets.  It
> is useful where many facets (50+) need to be calculated and bitset caching,
> loading and intersection would be too costly.  Instead it iterates over in
> memory custom field caches while hit collecting.  Because we're also doing
> realtime search, making the loading more efficient via the in memory field
> cache merging is interesting.


OK.

Does it operate at the segment level?  Seems like that'd give you good
enough realtime performance (though merging in RAM will definitely be
faster).

> True, we do the in memory merging with deleted docs, norms would be good as
> well.

Yes, and eventually column stride fields.

> As a first step how should we expose the segments a segment has
> originated from?

I'm not sure; it's quite messy.  Each segment must track what other
segment it got merged to, and must hold a copy of its deletes as of
the time it was merged.  And each segment must know what other
segments it got merged with.

Is this really a serious problem in your realtime search?  Eg, from
John's numbers in using payloads to read in the docID -> UID mapping,
it seems like you could make a Query that when given a reader would go
and do the "Approach 2" to perform the deletes (if indeed you are
needing to delete thousands of docs with each update).  What sort of
docs/sec rates are you needing to handle?

> I would like to get this implemented for 2.9 as a building
> block that perhaps we can write other things on.

I think that's optimistic.  It's still at the
hairy-can't-see-a-clean-way-to-do-it phase still.  Plus I'd like to
understand that all other options have been exhausted first.

Especially once we have column stride fields and they are merged in
RAM, you'll be handed a reader pre-warmed and you can then jump
through those arrays to find docs to delete.

> Column stride fields still
> requires some encoding and merging field caches in ram would presumably be
> faster?

Yes, potentially much faster.  There's no sense in writing through to
disk until commit is called.

>> Ie we only have to renumber from gen X to X+1, then from X+1 to X+2 (where
>> each "generation" is a renumbering event).
>
> Couldn't each SegmentReader keep a docmap and the names of the segments it
> originated from.  However the name is not enough of a unique key as there's
> the deleted docs that change?  It seems like we need a unique id for each
> segment reader, where the id is assigned to cloned readers (which normally
> have the same segment name as the original SR).  The ID could be a stamp
> (perhaps only given to readonlyreaders?).  That way the
> SegmentReader.getMergedFrom method does not need to return the actual
> readers, but a docmap and the parent readers IDs?  It would be assumed the
> user would be holding the readers somewhere?  Perhaps all this can be
> achieved with a callback in IW, and all this logic could be kept somewhat
> internal to Lucene?

The docMap is a costly way to store it, since it consumes 32 bits per
doc (vs storing a copy of the deleted docs).

But, then docMap gives you random-access on the map.

What if prior to merging, or committing merged deletes, there were a
callback to force the app to materialize any privately buffered
deletes?  And then the app is not allowed to use those readers for
further deletes?  Still kinda messy.

I think I need to understand better why delete by Query isn't viable
in your situation...

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Future projects

Reply via email to