[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

Mike Sokolov (JIRA) Mon, 11 Jul 2011 13:05:24 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063499#comment-13063499
 ]


Mike Sokolov commented on LUCENE-2878:
--------------------------------------

OK I think I brushed by some of your comments, Simon, in my hasty response, 
sorry.  Here's a little more thought, I hope:

bq. So bottom line here is that we need an api that is capable of collecting 
fine grained parts of the scorer tree. The only way I see doing this is 1. have 
a subscribe / register method and 2. do this subscription during scorer 
creation. Once we have this we can implement very simple collect methods that 
only collect positions for the current match like in a near query, while the 
current matching document is collected all contributing TermScorers have their 
positioninterval ready for collection. The collect method can then be called 
from the consumer instead of in the loop this way we only get the positions we 
need since we know the document we are collecting.

I *think* it's necessary to have both a callback from within the scoring loop, 
and a mechanism for iterating over the current state of the iterator.  For 
boolean queries, the positions will never be iterated in the scoring loop (all 
you care about is the frequencies, positions are ignored), so some new process: 
either the position collector (highlighter, say), or a loop in the scorer that 
knows positions are being consumed (needsPositions==true) has to cause the 
iteration to be performed.  But for position-aware queries (like phrases), the 
scorer *will* iterate over positions, and in order to score properly, I think 
the Scorer has to drive the iteration?  I tried a few different approaches at 
this before deciding to just push the iteration into the Scorer, but none of 
them really worked properly.

Let's say, for example that a document is collected.  Then the position 
consumer comes in to find out what positions were matched - it may already too 
late, because during scoring, some of the positions may have been consumed (eg 
to score phrases)?  It's possible I may be suffering from some delusion, though 
:)  But if I'm right, then it means there has to be some sort of callback 
mechanism in place *during scoring*, or else we have to resign ourselves to 
scoring first, and then re-setting and iterating positions in a second pass.

I actually think that if we follow through with the 
registration-during-construction idea, we can have the tests done in an 
efficient way during scoring (with final boolean properties of the scorers), 
and it can be OK to have them in the scoring loop.

> Allow Scorer to expose positions and payloads aka. nuke spans 
> --------------------------------------------------------------
>
>                 Key: LUCENE-2878
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2878
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: Bulk Postings branch
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>         Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
> PosHighlighter.patch, PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

Reply via email to