I guess this also ties in with 'getPositionIncrementGap', which is relevant
to fields with multiple occurrences.

Peter

On 7/27/07, Peter Keegan <[EMAIL PROTECTED]> wrote:
>
> I have a question about the way fields are analyzed and inverted by the
> index writer. Currently, if a field has multiple occurrences in a document,
> each occurrence is analyzed separately (see DocumentsWriter.processField).
> Is it safe to assume that this behavior won't change in the future? The
> reason I ask is that my custom analyzer's 'tokenStream' method creates a
> custom filter which produces a payload based on the existence of each field
> occurrence. However, if DocumentsWriter was changed and combined all the
> occurrences before inversion, my scheme wouldn't work.  Since payloads are
> created by filters/tokenizers, it helps to keep things flexible.
>
> Thanks,
> Peter
>
>
> On 7/12/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> >
> >
> > On Jul 12, 2007, at 6:12 PM, Chris Hostetter wrote:
> >
> >
> > >
> > > Hmm... okay so the issue is that in order to get the payload data, you
> > > have to have a TermPositions instance.
> > >
> > > instead of adding getPayload methods to the Spans class (which as Paul
> >
> > > points out, can have nesting issues) perhaps more general solutions
> > > would
> > > be:
> > >
> > > a) a more high level getPayload API that let's you get a payload
> > > arbitrarily for a toc/position (perhaps as part of the TernDocs
> > > API?) ...
> > > then for Spans you could use this new API with Spans.start() and
> > > Spans.end(). (and all the positions in between)
> >
> > Not sure I follow this.  I don't see the fit w/ TermDocs.
> > >
> > > b) add a variation of the TermPositions class to allow people to
> > > iterate
> > > through the terms of a TermDoc in position order (TermPosition first
> > > iterates over the Terms and then over the positions) ... then you
> > > could
> > > seek(span.start()) to get the Payload data
> > >
> > > c) add methods to the Spans API to get the subspans (if any) ... this
> > > would be the Spans corrilary to getTerms() and would always return
> > > TermSpans which would have TermPositions for getting payload data.
> >
> >
> > This could be a good alternative.
> >
> > When we first talked about payloads we wondered if we could just make
> > all Queries into SpanQueries by passing TermPositions instead of term
> > docs, but in the end decided not to do it because of performance
> > issues (some of which are lessened by lazy loading of TermPositions.
> >
> > The thing is, I think, that the Spans is already moving you along in
> > the term positions, so it just seems like a natural fit to have it
> > there, even if there is nesting.  It doesn't seem like it would be
> > that hard to then return back the nesting stuff b/c you are just
> > collating the results from the underlying SpanTermQuery.  Having said
> > that, I haven't looked into the actual code, so take that w/ a grain
> > of salt.
> >
> > I will try to do some more investigation, as others are welcome to
> > do.  Perhaps we should move this to dev?
> >
> > Cheers,
> > Grant
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
>

Reply via email to