Re: Flex & Docs/AndPositionsEnum

Michael McCandless Wed, 10 Feb 2010 03:58:33 -0800

On Tue, Feb 9, 2010 at 4:44 PM, Marvin Humphrey <mar...@rectangular.com> wrote:


>> Interesting... and segment merging just does its own private
>> concatenation/mapping-around-deletes of the doc/positions?
>
> I think the answer is yes, but I'm not sure I understand the
> question completely since I'm not sure why you'd ask that in this
> context.

Segment merging is one place that "legitimately" needs to append
docs/positions enum of multiple sub readers... but obviously it can
just do this itself (and it must, since it renumbers the docIDs).

>> what's a "flat positions space"?
>
> It's something Google once used.  Instead of positions starting with
> 0 at each document, they just keep going.
>
>  doc 1:  "Three Blind Mice"           - positions 0, 1, 2
>  doc 2:  "Peter Peter Pumpkin Eater"  - positions 3, 4, 5, 6
>
>> And we don't return "objects or aggregates" with Multi*Enum now...
>
> Yeah, this is different.  In KS right now, we use a generic
> PostingList, which conveys different information depending on what
> class of Posting it contains.

OK

>> In flex right now the codec is unware that it's being "consumed" by
>> a Multi*Enum.
>
> Right, but in KinoSearch's case PostingList had to be aware of that
> because the Posting object could be consumed at either the segment
> level or the index level -- so it needed a setDocBase(offset) method
> which adjusted the doc num in the Posting.  It was messy.
>
> The change I made was to eliminate PolyPostingList and
> PolyPostingListReader, which made it possible to remove the
> setDocBase() method from SegPostingList.

But why didn't you have the Multi*Enums layer add the offset (so that
the codec need not know who's consuming it)?  Performance?

>> It still returns primitives.  If instead we returned an int[] for
>> positions (hmm -- may be a good reason to make positions be an
>> Attribute, Uwe), I think it would still be OK?
>
> In the flat positions space example, it would be necessary to add an
> offset to each of the positions in that array.  Each segment would
> have a "positions max" analogous to maxDoc(); these would be summed
> to obtain the positions offset the same way we add up maxDoc() now
> to obtain the doc id offset.

OK, but [so far] we don't have that problem with the flex APIs -- the
codec is not aware that there's a multi enum layer consuming it.

> That example may not be a deal breaker for you, but I'm not willing
> to guarantee that Lucy will always return primitives from these
> enums, now and forever, one per method call.

But it'd be a major API change down the road to change this, for
Lucy/KS?  Ie this example seems not to apply to Lucene, and even for
KS/Lucy seems contrived -- neither Lucene nor KS/Lucy would/could up
and make such a major API change to the enums, once "committed".

Also, this is why we're adding Attribute* to all the postings enums,
with flex -- any codec & consumer can use their own private
attributes.  The attrs pass through Multi*Enum.

>> Still torn... I think it's convenience vs performance.
>
> But convenience for the posting format plugin developer matters too,
> right?

Right but the existince of Multi*Enums isn't affecting the codec dev
(so far, I think).

> Are you confident that a generic aggregator can support all possible
> codecs, or will plugin developers be forced to ensure that
> aggregation works because you've guaranteed to users like Renaud
> that it will?

Well... pretty confident.  So far, at least?  We have an existence
proof :) The codec API really should not (and, should not have to)
bake in details of who's consuming it.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Flex & Docs/AndPositionsEnum

Reply via email to