[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

Simon Willnauer (JIRA) Mon, 11 Jul 2011 07:34:24 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063362#comment-13063362
 ]


Simon Willnauer commented on LUCENE-2878:
-----------------------------------------

{quote} We want to highlight positions that explain why the document matches 
the query. Not all terms that match the term queries will count - some of them 
should be "filtered out" by near-conditions; ie in a PhraseQuery, matching 
terms not in the phrase should not be highlighted. I think if I just register a 
callback with the sub-scorers (scoring terms), I would see all the terms, 
right? {quote}

this is why I think we should add a dedicated collector API (ie. not part of 
Collector maybe an interface?). the current api gives you a "view" for each 
match meaning that once you advance the iterator you get the positions for the 
"current" positional match. I think the caller should also drive the collection 
of intermediate positions / intervals. The big challenge here is to collect the 
positions you are interested in efficiently. I agree that the if(foo==null) is 
a problem as long as foo is not final so maybe we should try to make them final 
and make the pos collector part of the scorer setup (just a thought), we could 
do that using a ScorerContext for instance.

{quote}
 make further progress, I think we need to resolve the position API. The 
testMultipleDocumentsOr test case illustrates the problem with the approach I 
was trying: walking the PositionIterator tree when collecting documents. 
Something like the PositionCollector API could work, but I think we still need 
to solve the problem Mike M alluded to back at the beginning:
{quote} 
Agreed we should work on the API. I looked at your patch and some changes 
appear to be not necessary IMO. Like the problems in testMultipleDocumentsOr 
are not actually a problem if we sketch this out properly. As I said above if 
the collector is part of the initialization we can simply pass them to the 
leaves or intermediate scorers and collect safely even if scorers are advanced. 
Since during Documents collection the view should be stable, right? 
So bottom line here is that we need an api that is capable of collecting fine 
grained parts of the scorer tree. The only way I see doing this is 1. have a 
subscribe / register method and 2. do this subscription during scorer creation. 
Once we have this we can implement very simple collect methods that only 
collect positions for the current match like in a near query, while the current 
matching document is collected all contributing TermScorers have their 
positioninterval ready for collection. The collect method can then be called 
from the consumer instead of in the loop this way we only get the positions we 
need since we know the document we are collecting. 

bq. The core problem solved here is how to report positions that are not 
consumed during scoring, and also those that are,
this can be solved by my comment above?

{quote} The interesting case is PositionFilterScorer, which filters its child 
Scorers. I added PositionIntervalIterator.getTermPositions() to enable this; 
this walks the tree of position iterators and returns a snapshot of their 
current state (as another iterator) so the consumer can retrieve all the term 
positions as filtered by intermediate iterators without advancing them.
{quote}
this would work the same way ey? We register during setup, something like  
{code}void PositinoCollector#registerScorer(Scorer){code} then we can decide 
that if we need that scorer or rather its positions for collection or not. The 
entire iteration should only be driven by the top-level consumer, if you 
advance the iterator on an intermediate iterator you might break some higher 
level algs. like conjunction / disjunction though. So lets drive this further, 
lets say we have all collectors that we are interested in, when should we 
collect positions? I think the top level consumer should 1. advance the 
positions 2. call collect on the scorers we are interested.
While I talk about this I start realizing that it might even be easier that 
this if we walk the PositionInterator tree rather than the scorer tree and 
collect the positin iterators from there. This is already possible with the 
subs() call right? What we essentially need is a method that returns the 
current interval for each of the iterators. It still might be needed to have a 
collect method on the iterator so that something like Conjunctions can call 
collect on the subs if needed?

Oh man this is all kind of tricky ey :)

bq. There are a few (11) failing tests with this branch+patch (ran lucene tests 
only), but they seem unrelated (TestFlushByRamOrCountsPolicy has 5, eg) I am 
ignoring?

I don't see anything failing... can you attach a file with the failures?


> Allow Scorer to expose positions and payloads aka. nuke spans 
> --------------------------------------------------------------
>
>                 Key: LUCENE-2878
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2878
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/search
>    Affects Versions: Bulk Postings branch
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>         Attachments: LUCENE-2878-OR.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, LUCENE-2878.patch, 
> LUCENE-2878.patch, LUCENE-2878_trunk.patch, LUCENE-2878_trunk.patch, 
> PosHighlighter.patch, PosHighlighter.patch
>
>
> Currently we have two somewhat separate types of queries, the one which can 
> make use of positions (mainly spans) and payloads (spans). Yet Span*Query 
> doesn't really do scoring comparable to what other queries do and at the end 
> of the day they are duplicating lot of code all over lucene. Span*Queries are 
> also limited to other Span*Query instances such that you can not use a 
> TermQuery or a BooleanQuery with SpanNear or anthing like that. 
> Beside of the Span*Query limitation other queries lacking a quiet interesting 
> feature since they can not score based on term proximity since scores doesn't 
> expose any positional information. All those problems bugged me for a while 
> now so I stared working on that using the bulkpostings API. I would have done 
> that first cut on trunk but TermScorer is working on BlockReader that do not 
> expose positions while the one in this branch does. I started adding a new 
> Positions class which users can pull from a scorer, to prevent unnecessary 
> positions enums I added ScorerContext#needsPositions and eventually 
> Scorere#needsPayloads to create the corresponding enum on demand. Yet, 
> currently only TermQuery / TermScorer implements this API and other simply 
> return null instead. 
> To show that the API really works and our BulkPostings work fine too with 
> positions I cut over TermSpanQuery to use a TermScorer under the hood and 
> nuked TermSpans entirely. A nice sideeffect of this was that the Position 
> BulkReading implementation got some exercise which now :) work all with 
> positions while Payloads for bulkreading are kind of experimental in the 
> patch and those only work with Standard codec. 
> So all spans now work on top of TermScorer ( I truly hate spans since today ) 
> including the ones that need Payloads (StandardCodec ONLY)!!  I didn't bother 
> to implement the other codecs yet since I want to get feedback on the API and 
> on this first cut before I go one with it. I will upload the corresponding 
> patch in a minute. 
> I also had to cut over SpanQuery.getSpans(IR) to 
> SpanQuery.getSpans(AtomicReaderContext) which I should probably do on trunk 
> first but after that pain today I need a break first :).
> The patch passes all core tests 
> (org.apache.lucene.search.highlight.HighlighterTest still fails but I didn't 
> look into the MemoryIndex BulkPostings API yet)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-2878) Allow Scorer to expose positions and payloads aka. nuke spans

Reply via email to