[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

Simon Willnauer (JIRA) Wed, 29 Jun 2011 01:36:07 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13057083#comment-13057083
 ]


Simon Willnauer commented on LUCENE-2878:
-----------------------------------------

Hey Mike,
great to see interest here! :)

bq. boolean Collector.needsPositions() and needsPayloads()
+1 that makes lots of sense. 

Let me give you some insight of the current patches state. This whole thing is 
still a prototype and needs lots of cleanups all over the place. I moved it to 
trunk lately since I don't want to wait for bulkpostings to move forward. I 
think there are lots of perf impacts with its current state but eventually I 
think it will be much better, more powerful and cleaner than spans after all. 

bq. And then I am accessing the scorer.positions() from Collector.collect(), 
which I think is a very natural use of this API? At least it was intuitive for 
me, and I am pretty new to all this.

this is one way of doing it for sure. The other way would be to wrap the top 
level scorer and do your work in there with a PositionScoringQueryWrapper or 
something like that which would set up the ScorerContext for you. The main 
question is what you want to do with positions. For matching based on positions 
you have to use some scorer I guess since you need to check every document if 
it is within your position constraints, something like near(a AND b). If you 
want to boost based on your positions I think you need to do a 2 phase 
collection, Phase 1 simply running the query collecting n + X results and Phase 
2 re-ranking the results from Phase 1 by pulling the positions. 

bq. I think that when it comes to traversing the tree of 
PositionsIntervalIterators, the API you propose above might have some issues
I agree this is very flaky right now and I only tried to mimic the spans 
behavior here to show that this is as powerful as spans for now. But eventually 
we need a better API for this, so its good you are jumping in with a usecase!

bq. What would the status of the returned iterators be?
currently if you pull an iterator you are depending on the state of your 
scorer. Let me give you an example on TermScorer, if you are on document X you 
can iterate the positions for this document if you exhaust them or not once the 
scorer is advanced your PositionInterator points to the documents position you 
advanced to. The same is true for all other Scorers that expose positions. Yet, 
some problems arise here with BooleanScorer (in contrast to BooleanScorer2) 
since it reads documents in blocks which makes it very hard (nearly impossible) 
to get efficient positions for this scorer (its used for OR queries only with 
NOT clauses < 32). 
So PositionsInterators are never preserve positions for a document you pulled 
the interval for. You can basically pull the iterator only once and keep it 
until you scorer is exhausted. Bottom line here is that you are depending on 
the DocsAndPositionsEnum your TermScorer is using. Once this is advanced your 
positions are advanced too. We could think of a separate Enum here that 
advances independently, hmm that could actually work too, lets keep that in 
mind.

bq.  (so scoring isn't impacted by some other consumer of position intervals)
there should be only one consumer really. Which usecase have you in mind where 
multiple consumers are using the iterator?

bq. PositionInterval PositionIntervalIterator.current()
what is the returned PI here again? In the TermScorer case that is trivial but 
what would a BooleanSocorer return here?

bq. (2) return from subs() and nextSubIntervals() some unmodifiable wrappers - 
maybe a superclass of PII that would only provide current() and subs(), but not 
allow advancing the iterator.

I think that could make sense but let me explain the reason why this is there 
right now. So currently a socrer has a defined PositionIterator which could be 
a problem later. for instance I want to have the minimal positions interval 
(ordered) of all boolean clauses for query X but for query Y I want the same 
interval unorderd (out of order) I need to replace the logic in the scorer 
somehow. So to make that more flexible I exposed all subs here so you can run 
your own alg. I would love to see better solutions since I only hacked this up 
in a couple of days though. 

Currently this patch provides an AND (ordered & un-ordered) and a BLOCK 
PositionIterator based on this paper 
http://vigna.dsi.unimi.it/ftp/papers/EfficientAlgorithmsMinimalIntervalSemantics
 while the OR implementation is still missing so if you want to jump on that 
issue and help there is tons of space for improvements. 

Eventually I think we can leave spans as they are right now and concentrate on 
the API / functionality, making things fast under the hood can be done later 
but getting things right to be flexible is the most important part here. 

Mike, would you be willing to upload a patch for your hacked collector etc to 
see what you have done?

bq. I hope you'll be able to pick it up again soon, Simon!
I would love to ASAP, currently I have so much DocValues stuff todo so this 
might take a while until I get back to this.






 

> Allow Scorer to expose positions and payloads aka. nuke spans 
> --------------------------------------------------------------
>
>                 Key: LUCENE-2878
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2878
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: Bulk Postings branch
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>         Attachments: LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

Reply via email to