Mike,
I had another look at SegmentTermDocs.skipTo() and at
SegmentTermPositions, and I think I'm beginning to get
your point.
Could it be doable per skipInterval docs?
Regards,
Paul Elschot
Op Monday 22 September 2008 19:24:38 schreef Michael McCandless:
> OK, on closer inspection, I don't think this optimization will work,
> unless I'm missing something... But it was a good idea, so keep em
> coming!
>
> The TermInfo only stores proxPointer for each term, not per document
> in the postings. This means the optimization could only apply if
> there are no deleted docs in the posting, and the in & out formats
> are congruent. Then we would move writing to proxOutput out of the
> while loop in appendPostings to do a bulk copy of all bytes in the
> proxStream for that one term & segment.
>
> But, there's a problem with that: we can't compute the skip pointer
> as we write. The DefaultSkipListWriter looks at the proxOutput
> pointer every skipInterval docs written and records the offset. If
> we bulk-copy the prox bytes at the end we have no idea what the
> offset is every skipInterval docs.
>
> Mike
>
> Paul Elschot wrote:
> > Op Friday 19 September 2008 17:05:29 schreef Michael McCandless:
> >> Not quite, because how positions are encoded depends on whether
> >> any payload appeared in that segment.
> >>
> >> However, if 1) the input is a SegmentReader (since in general we
> >> can merge any IndexReader), and 2) its format is "congruent" with
> >> the format we are writing (ie both don't or do use the payloads
> >> format), which ought to be true the vast majority of the time,
> >> then I think we could simply copy bytes. Since the next TermInfo
> >> tells us the proxPointer where it begins, we know exactly how many
> >> bytes to copy. I think this'd be a nice optimization!
> >
> > I tried to find a way to do this, but I'm stuck at the point where
> > the proxPointer is needed from a TermInfo.
> > I got this far (uncompiled code, smi is the SegmentMergeInfo
> > that is currently merged):
> >
> > if (smi.reader instanceof SegmentReader) {
> > SegmentReader inputReader = smi.reader;
> > boolean readerStorePayloads =
> > inputReader.fieldInfos.fieldInfo(smi.term.field).storePayloads;
> > if (storePayloads == readerStorePayloads) {
> > // take the difference of the two prox pointers:
> > int positionsLength = inputReader.tis. ... - ...;
> > // do a direct byte copy from inputReader to proxOutput:
> > ... ;
> > }
> > }
> >
> > but I could not find out how to get from the TermInfosReader
> > at inputReader.tis to the next prox pointer.
> >
> > SegmentMerger never needs to index the positions by using a
> > proxPointer itself, as it accesses all positions serially. This
> > leaves me without an example on how to use proxPointer from a
> > TermInfo.
> >
> > Any tips on how to continue?
> >
> > Regards,
> > Paul Elschot
> >
> >> Mike
> >>
> >> Paul Elschot wrote:
> >>> I'm looking at the for loop in SegmentMerger.java at line 666,
> >>> which completely interprets the input positions/payloads for
> >>> an input term at a document.
> >>>
> >>> The positions/payloads don't change when they merged, is that
> >>> correct? I'm wondering whether this loop could be replaced by a
> >>> direct copy from
> >>> the input postings to proxOutput.
> >>>
> >>> Regards,
> >>> Paul Elschot
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]