On Tue, Feb 9, 2010 at 4:44 PM, Marvin Humphrey <mar...@rectangular.com> wrote:
>> Interesting... and segment merging just does its own private >> concatenation/mapping-around-deletes of the doc/positions? > > I think the answer is yes, but I'm not sure I understand the > question completely since I'm not sure why you'd ask that in this > context. Segment merging is one place that "legitimately" needs to append docs/positions enum of multiple sub readers... but obviously it can just do this itself (and it must, since it renumbers the docIDs). >> what's a "flat positions space"? > > It's something Google once used. Instead of positions starting with > 0 at each document, they just keep going. > > doc 1: "Three Blind Mice" - positions 0, 1, 2 > doc 2: "Peter Peter Pumpkin Eater" - positions 3, 4, 5, 6 > >> And we don't return "objects or aggregates" with Multi*Enum now... > > Yeah, this is different. In KS right now, we use a generic > PostingList, which conveys different information depending on what > class of Posting it contains. OK >> In flex right now the codec is unware that it's being "consumed" by >> a Multi*Enum. > > Right, but in KinoSearch's case PostingList had to be aware of that > because the Posting object could be consumed at either the segment > level or the index level -- so it needed a setDocBase(offset) method > which adjusted the doc num in the Posting. It was messy. > > The change I made was to eliminate PolyPostingList and > PolyPostingListReader, which made it possible to remove the > setDocBase() method from SegPostingList. But why didn't you have the Multi*Enums layer add the offset (so that the codec need not know who's consuming it)? Performance? >> It still returns primitives. If instead we returned an int[] for >> positions (hmm -- may be a good reason to make positions be an >> Attribute, Uwe), I think it would still be OK? > > In the flat positions space example, it would be necessary to add an > offset to each of the positions in that array. Each segment would > have a "positions max" analogous to maxDoc(); these would be summed > to obtain the positions offset the same way we add up maxDoc() now > to obtain the doc id offset. OK, but [so far] we don't have that problem with the flex APIs -- the codec is not aware that there's a multi enum layer consuming it. > That example may not be a deal breaker for you, but I'm not willing > to guarantee that Lucy will always return primitives from these > enums, now and forever, one per method call. But it'd be a major API change down the road to change this, for Lucy/KS? Ie this example seems not to apply to Lucene, and even for KS/Lucy seems contrived -- neither Lucene nor KS/Lucy would/could up and make such a major API change to the enums, once "committed". Also, this is why we're adding Attribute* to all the postings enums, with flex -- any codec & consumer can use their own private attributes. The attrs pass through Multi*Enum. >> Still torn... I think it's convenience vs performance. > > But convenience for the posting format plugin developer matters too, > right? Right but the existince of Multi*Enums isn't affecting the codec dev (so far, I think). > Are you confident that a generic aggregator can support all possible > codecs, or will plugin developers be forced to ensure that > aggregation works because you've guaranteed to users like Renaud > that it will? Well... pretty confident. So far, at least? We have an existence proof :) The codec API really should not (and, should not have to) bake in details of who's consuming it. Mike --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org