[ https://issues.apache.org/jira/browse/LUCENE-4312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880635#comment-16880635 ]
Michael Gibney commented on LUCENE-4312: ---------------------------------------- True, both good points. But it's kind of a chicken-or-egg situation ... there would have been no point to address these implied challenges, so long as position length has not been recorded in the index (and is thus not available at query time). That doesn't mean there _aren't_ ways to address the challenges. Regarding the "A B C" example, I addressed this in the LUCENE-7398 work by indexing next start position as a lookahead. As a proof of concept this was done with Payloads, but in principle I could see slight modifications (somewhere at the intersection of codecs and postings API) that would natively read next start position "early" and expose it as a lookahead. This would avoid the type of problematic call to {{PostingsEnum.nextPosition()}} that would (as you correctly point out) result in the need to buffer all information associated with _every_ position. I've described this approach in more detail [here|https://michaelgibney.net/2018/09/lucene-graph-queries-2/#index-lookahead-don-t-buffer-positions-if-you-don-t-have-to]. {quote}we can't advance positions on terms in the order we want anymore. {quote} Yes, I'd argue that's the toughest challenge. I addressed it indirectly by constructing CommonGrams-style shingles used specifically for pre-filtering conjunctions in the "approximation" phase of two-phase iteration (ensuring that common terms at subclause index 0 don't kill performance). This is described in more detail [here|https://michaelgibney.net/2018/09/lucene-graph-queries-2/#shingle-based-pre-filtering-of-conjunctionspans]. I'm not intending this to be about these particular solutions, and you might take issue with the solutions themselves. The more general point I guess is that indexed position length is fundamental, and is a prerequisite for the development of ways to address these challenges. > Index format to store position length per position > -------------------------------------------------- > > Key: LUCENE-4312 > URL: https://issues.apache.org/jira/browse/LUCENE-4312 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs > Affects Versions: 6.0 > Reporter: Gang Luo > Priority: Minor > Labels: Suggestion > Original Estimate: 72h > Remaining Estimate: 72h > > Mike Mccandless said:TokenStreams are actually graphs. > Indexer ignores PositionLengthAttribute.Need change the index format (and > Codec APIs) to store an additional int position length per position. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org