[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

Mike Sokolov (JIRA) Sun, 03 Jul 2011 11:47:45 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059256#comment-13059256
 ]


Mike Sokolov commented on LUCENE-2878:
--------------------------------------

bq. So PositionsInterators are never preserve positions for a document you 
pulled the interval for. You can basically pull the iterator only once and keep 
it until you scorer is exhausted. Bottom line here is that you are depending on 
the DocsAndPositionsEnum your TermScorer is using. Once this is advanced your 
positions are advanced too. We could think of a separate Enum here that 
advances independently, hmm that could actually work too, lets keep that in 
mind.

So after working with this a bit more (and reading the paper), I see now that 
it's really not necessary to cache positions in the iterators.  So never mind 
all that!  In the end, for some uses like highlighting I think somebody needs 
to cache positions (I put it in a ScorePosDoc created by the PosCollector), but 
I agree that doesn't belong in the "lower level" iterator.

bq. Eventually I think we can leave spans as they are right now and concentrate 
on the API / functionality, making things fast under the hood can be done later 
but getting things right to be flexible is the most important part here.

As I'm learning more, I am beginning to see this is going to require sweeping 
updates.  Basically everywhere we currently create a DocsEnum, we might now 
want to create a DocsAndPositionsEnum, and then the options (needs 
positions/payloads) have to be threaded through all the surrounding APIs. I 
wonder if it wouldn't make sense to encapsulate those options 
(needsPositions/needsPayloads) in some kind of EnumConfig object.  Just in 
case, down the line, there is some other information that gets stored in the 
index, and wants to be made available during scoring, then the required change 
would be much less painful to implement.

I'm thinking for example (Robert M's idea), that it might be nice to have a 
positions->offsets map in the index (this would be better for highlighting than 
term vectors).  Maybe this would just be part of payload, but maybe not?  And 
it seems possible there could be other things like that we don't know about yet?

> Allow Scorer to expose positions and payloads aka. nuke spans 
> --------------------------------------------------------------
>
>                 Key: LUCENE-2878
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2878
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: Bulk Postings branch
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>         Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, PosHighlighter.patch, 
> PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

Reply via email to