[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy

2018-08-29 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597034#comment-16597034
 ] 

ASF subversion and git services commented on LUCENE-8286:
-

Commit bf7d1078e4ef6c99abaf5c76eccf56ed0f09f553 in lucene-solr's branch 
refs/heads/branch_7x from [~dsmiley]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bf7d107 ]

LUCENE-8286: UnifiedHighlighter: new HighlightFlag.WEIGHT_MATCHES for 
MatchesIterator API.
Other API changes: New UHComponents, and FieldOffsetStrategy takes a LeafReader 
not IndexReader now.
Closes #409

(cherry picked from commit b19ae942f154924b9108c4e0409865128f2a07d4)


> UnifiedHighlighter should support the new Weight.matches API for better match 
> accuracy
> --
>
> Key: LUCENE-8286
> URL: https://issues.apache.org/jira/browse/LUCENE-8286
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The new Weight.matches() API should allow the UnifiedHighlighter to more 
> accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903.
> In addition, this API should make the job of highlighting easier, reducing 
> the LOC and related complexities, especially the UH's PhraseHelper.  Note: 
> reducing/removing PhraseHelper is not a near-term goal since Weight.matches 
> is experimental and incomplete, and perhaps we'll discover some gaps in 
> flexibility/functionality.
> This issue should introduce a new UnifiedHighlighter.HighlightFlag enum 
> option for this method of highlighting.   Perhaps call it {{WEIGHT_MATCHES}}? 
>  Longer term it could go away and it'll be implied if you specify enum values 
> for PHRASES & MULTI_TERM_QUERY?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy

2018-08-29 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597019#comment-16597019
 ] 

ASF subversion and git services commented on LUCENE-8286:
-

Commit b19ae942f154924b9108c4e0409865128f2a07d4 in lucene-solr's branch 
refs/heads/master from [~dsmiley]
[ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b19ae94 ]

LUCENE-8286: UnifiedHighlighter: new HighlightFlag.WEIGHT_MATCHES for 
MatchesIterator API.
Other API changes: New UHComponents, and FieldOffsetStrategy takes a LeafReader 
not IndexReader now.
Closes #409


> UnifiedHighlighter should support the new Weight.matches API for better match 
> accuracy
> --
>
> Key: LUCENE-8286
> URL: https://issues.apache.org/jira/browse/LUCENE-8286
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The new Weight.matches() API should allow the UnifiedHighlighter to more 
> accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903.
> In addition, this API should make the job of highlighting easier, reducing 
> the LOC and related complexities, especially the UH's PhraseHelper.  Note: 
> reducing/removing PhraseHelper is not a near-term goal since Weight.matches 
> is experimental and incomplete, and perhaps we'll discover some gaps in 
> flexibility/functionality.
> This issue should introduce a new UnifiedHighlighter.HighlightFlag enum 
> option for this method of highlighting.   Perhaps call it {{WEIGHT_MATCHES}}? 
>  Longer term it could go away and it'll be implied if you specify enum values 
> for PHRASES & MULTI_TERM_QUERY?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy

2018-08-27 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593692#comment-16593692
 ] 

David Smiley commented on LUCENE-8286:
--

_The PR is ready to go I think._  I'll commit in a couple days.

OE.getTerm is implemented consistent with how the others work.

I also tracked down a curious observation in one of the tests ( *not* for 
MatchesIterator) that revealed that sloppy phrase queries sometimes won't 
highlight faithfully to the original because WeightedSpanTermExtractor's 
conversion of a PhraseQuery to a SpanQuery will set inOrder=false when there is 
slop.  This just goes to show that MatchesIterator based highlighting is more 
accurate in multiple ways.

Suggested CHANGES.txt:
The UnifiedHighlighter now has a new experimental HighlightFlag.WEIGHT_MATCHES 
flag that causes it to use Lucene's new Weight.getMatches API.  This will more 
accurately and strictly highlight, solving issues like LUCENE-7903.  Phrases 
will be formatted with a single span per occurrence instead of its words 
separately.  Passage relevancy might be degraded, however, since "freq" isn't 
calculated.  The flag is disabled by default.  There were some API changes that 
are public but internal to the UH, including a new UHComponents class.

> UnifiedHighlighter should support the new Weight.matches API for better match 
> accuracy
> --
>
> Key: LUCENE-8286
> URL: https://issues.apache.org/jira/browse/LUCENE-8286
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The new Weight.matches() API should allow the UnifiedHighlighter to more 
> accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903.
> In addition, this API should make the job of highlighting easier, reducing 
> the LOC and related complexities, especially the UH's PhraseHelper.  Note: 
> reducing/removing PhraseHelper is not a near-term goal since Weight.matches 
> is experimental and incomplete, and perhaps we'll discover some gaps in 
> flexibility/functionality.
> This issue should introduce a new UnifiedHighlighter.HighlightFlag enum 
> option for this method of highlighting.   Perhaps call it {{WEIGHT_MATCHES}}? 
>  Longer term it could go away and it'll be implied if you specify enum values 
> for PHRASES & MULTI_TERM_QUERY?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy

2018-08-13 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578835#comment-16578835
 ] 

David Smiley commented on LUCENE-8286:
--

Actually before continuing with any of that, I think the PR is *almost* good 
enough for this new mode.  It's not on by default so you have to opt-in.  If 
you do opt-in, you get
* Better matching accuracy, particularly with nested conjunction/disjunction 
(solving LUCENE-7903).
* Phrase queries will have highlights spanning more naturally instead of 
per-term.  Cosmetic but nice.  SpanQuery nested stuff is as-before in this 
regard, though.
* Passage scoring won't be as good due to a constant freq().  Some users won't 
care; arguably diversity of terms is more important, particularly in a snippet. 
 Note that consideration of freq() can be dialed down to nothing by setting the 
"k1" BM25 param of DefaultPassageScorer to 0, and this is in fact tested as 
having such an effect.

The only thing needed that isn't too disruptive is implementing 
OffsetsEnum.getTerm().  The current nocommit of an empty term is bad because 
it's also used in DefaultPassageScorer to calculate per-term stats.  So this 
ought to return the actual term, although it'd be fine if it was actually a 
query.toString() in truth.  So in the interest of getting an experimental 
feature out the door, I think I'll do the latter.  Only someone customizing the 
scorer or formatter in a way to depend on the nature of the term would be 
impacted by that.

> UnifiedHighlighter should support the new Weight.matches API for better match 
> accuracy
> --
>
> Key: LUCENE-8286
> URL: https://issues.apache.org/jira/browse/LUCENE-8286
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The new Weight.matches() API should allow the UnifiedHighlighter to more 
> accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903.
> In addition, this API should make the job of highlighting easier, reducing 
> the LOC and related complexities, especially the UH's PhraseHelper.  Note: 
> reducing/removing PhraseHelper is not a near-term goal since Weight.matches 
> is experimental and incomplete, and perhaps we'll discover some gaps in 
> flexibility/functionality.
> This issue should introduce a new UnifiedHighlighter.HighlightFlag enum 
> option for this method of highlighting.   Perhaps call it {{WEIGHT_MATCHES}}? 
>  Longer term it could go away and it'll be implied if you specify enum values 
> for PHRASES & MULTI_TERM_QUERY?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy

2018-08-13 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578802#comment-16578802
 ] 

David Smiley commented on LUCENE-8286:
--

Made substantial progress to the PR:
{noformat}
LUCENE-8286 UH: Use MI.getSubMatches().  Removed PhraseHelper changes; not 
necessary anymore.
Updated based on MI improvements in master.
With subMatches, we have better fidelity on span queries.
And since MI can handle span queries now, no need to touch PhraseHelper.
* added to UHComponents: query, and highlightFlags
* updated tests to handle with/without WEIGHT_MATCHES
* TestUnifiedHighlighterStrictPhrases uses more randomization.
  Removed brittle score calculation dependence.
* Test Passage matches data is in order
TODO: OE freq & term()
{noformat}
It was nice to see that UH's PhraseHelper can be circumvented now.  Handling 
mi.getSubMatches proved to be difficult, but I ultimately got it working.  See 
https://github.com/dsmiley/lucene-solr/blob/LUCENE-8286/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/OffsetsEnum.java#L168

Next up is handling OffsetsEnum.getTerm().  I could change the API so that 
getTerm() returns getQuery() and consequently update Passage & PassageScorer.  
Callers of getTerm() were all internal or considered experimental any way 
(definitely not in common use) so I think it could change in a minor release.  
I hope multi-term query types will be retained as such but I fear 
MatchesIterator expands before retaining the original, and thus the results 
here won't be as ideal but adequate.

Then, OffsetsEnum.freq().  This one is hard.  We could make "-1" an unsupported 
value.  Then, a new PassageScorer design that is created per highlighted field 
value could be given access to the IndexReader in 
org.apache.lucene.search.uhighlight.FieldHighlighter#highlightOffsetsEnums.  
When it sees -1 at scoring time, it could calculate the in-doc freq and cache 
it.  Or similarly... maybe we don't care that much about the in-doc freq; it 
may be expensive to calculate any way.  Maybe we want the associated Query's 
score for this document (which will consider global stats like IDF), but again 
will need access to the IndexReader.  It'd be nice if boosts wrapped around the 
query could be considered but it's just not there (also true without MI mode).

> UnifiedHighlighter should support the new Weight.matches API for better match 
> accuracy
> --
>
> Key: LUCENE-8286
> URL: https://issues.apache.org/jira/browse/LUCENE-8286
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The new Weight.matches() API should allow the UnifiedHighlighter to more 
> accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903.
> In addition, this API should make the job of highlighting easier, reducing 
> the LOC and related complexities, especially the UH's PhraseHelper.  Note: 
> reducing/removing PhraseHelper is not a near-term goal since Weight.matches 
> is experimental and incomplete, and perhaps we'll discover some gaps in 
> flexibility/functionality.
> This issue should introduce a new UnifiedHighlighter.HighlightFlag enum 
> option for this method of highlighting.   Perhaps call it {{WEIGHT_MATCHES}}? 
>  Longer term it could go away and it'll be implied if you specify enum values 
> for PHRASES & MULTI_TERM_QUERY?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy

2018-07-11 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540745#comment-16540745
 ] 

David Smiley commented on LUCENE-8286:
--

I updated the PR significantly.  It addresses requireFieldMatch/fieldMatcher 
and some other cases.  [~romseygeek] you might find it worthwhile to see the 
changes as some are applicable to the highlighter you're working on.  See 
OverlaySingleDocTermsLeafReader in particular.  The different aspects of the 
changes were reasonably separated out to separate commits.  There are a couple 
nocommits.  There are a few failing tests but before I can make substantive 
progress at this point, it's dependent on getting access to the matching terms 
for passage scoring.

> UnifiedHighlighter should support the new Weight.matches API for better match 
> accuracy
> --
>
> Key: LUCENE-8286
> URL: https://issues.apache.org/jira/browse/LUCENE-8286
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8286.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The new Weight.matches() API should allow the UnifiedHighlighter to more 
> accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903.
> In addition, this API should make the job of highlighting easier, reducing 
> the LOC and related complexities, especially the UH's PhraseHelper.  Note: 
> reducing/removing PhraseHelper is not a near-term goal since Weight.matches 
> is experimental and incomplete, and perhaps we'll discover some gaps in 
> flexibility/functionality.
> This issue should introduce a new UnifiedHighlighter.HighlightFlag enum 
> option for this method of highlighting.   Perhaps call it {{WEIGHT_MATCHES}}? 
>  Longer term it could go away and it'll be implied if you specify enum values 
> for PHRASES & MULTI_TERM_QUERY?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy

2018-06-24 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16521825#comment-16521825
 ] 

David Smiley commented on LUCENE-8286:
--

I pushed the patch to GitHub as a linked PR.  The only change is reverting the 
choice of having OffsetsEnum implement MatchesIterator as it turned out to be 
rather pointless.

> UnifiedHighlighter should support the new Weight.matches API for better match 
> accuracy
> --
>
> Key: LUCENE-8286
> URL: https://issues.apache.org/jira/browse/LUCENE-8286
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8286.patch
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> The new Weight.matches() API should allow the UnifiedHighlighter to more 
> accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903.
> In addition, this API should make the job of highlighting easier, reducing 
> the LOC and related complexities, especially the UH's PhraseHelper.  Note: 
> reducing/removing PhraseHelper is not a near-term goal since Weight.matches 
> is experimental and incomplete, and perhaps we'll discover some gaps in 
> flexibility/functionality.
> This issue should introduce a new UnifiedHighlighter.HighlightFlag enum 
> option for this method of highlighting.   Perhaps call it {{WEIGHT_MATCHES}}? 
>  Longer term it could go away and it'll be implied if you specify enum values 
> for PHRASES & MULTI_TERM_QUERY?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy

2018-06-10 Thread David Smiley (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16507691#comment-16507691
 ] 

David Smiley commented on LUCENE-8286:
--

The first patch here is my working WIP.  Everything compiles and the results 
are generally reasonable, notwithstanding some known issues already pointed out 
from my previous comment.  I enabled it by default and then looked to see what 
tests broke and why:

* TestUnifiedHighlighter: all failures are for the testFieldMatcher methods 
since the fieldMatcher mechanism doesn't yet work with this (mentioned in prev 
comment)
* TestUnifiedHighlighterMTQ.testWhichMTQMatched: because MatchesIterator 
doesn't yet expose which term matched.
* TestUnifiedHighlighterRanking: failed because the scoring isn't the same
* TestUnifiedHighlighterTermVec.testFetchTermVecsOncePerDoc: randomly fails 
because sometimes the underlying fields don't have a real index.  The UH 
highlights one field at a time and _that_ field being highlighted will be made 
to appear as indexed if it wasn't already (e.g. re-analysis into MemoryIndex or 
TV LeafReader wrapper) but no other fields will be.  I think once a solution to 
fieldMatcher works, it may solve the situation here.
* TestUnifiedHighlighterStrictPhrases: i haven't reviewed each failure yet but 
it all seems to be due to the distinction between highlighting words in phrases 
by themselves or highlighting the phrase span.  All the assertions assume words 
by themselves.

What's cool is that this wasn't a big change, and it can be intermixed with 
SpanQueries.  I need to look at the scoring options more -- loss of freq() is a 
shame.

> UnifiedHighlighter should support the new Weight.matches API for better match 
> accuracy
> --
>
> Key: LUCENE-8286
> URL: https://issues.apache.org/jira/browse/LUCENE-8286
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Priority: Major
> Attachments: LUCENE-8286.patch
>
>
> The new Weight.matches() API should allow the UnifiedHighlighter to more 
> accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903.
> In addition, this API should make the job of highlighting easier, reducing 
> the LOC and related complexities, especially the UH's PhraseHelper.  Note: 
> reducing/removing PhraseHelper is not a near-term goal since Weight.matches 
> is experimental and incomplete, and perhaps we'll discover some gaps in 
> flexibility/functionality.
> This issue should introduce a new UnifiedHighlighter.HighlightFlag enum 
> option for this method of highlighting.   Perhaps call it {{WEIGHT_MATCHES}}? 
>  Longer term it could go away and it'll be implied if you specify enum values 
> for PHRASES & MULTI_TERM_QUERY?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy

2018-05-03 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462466#comment-16462466
 ] 

David Smiley commented on LUCENE-8286:
--

The "span" width _could_ be used for passage relevancy, and perhaps ought to be 
– sure.  I just meant to convey that today the UH doesn't have or use this info.

BTW I did a quick hack integration last night of Weight.getMatches into the UH 
and ran some tests.  I had no issue with term vectors.   The fieldMatcher (aka 
requireFieldMatch option) will require some work.  And if the query references 
non-highlighted fields in a way that will constraint the results (i.e. MUST 
otherfield:foo), for the Analysis offset strategy, we'll need to combine an 
aggregate index view of analysis with the underlying real index for other 
fields because the MemoryIndex alone only has one field – the field being 
highlighted.

> UnifiedHighlighter should support the new Weight.matches API for better match 
> accuracy
> --
>
> Key: LUCENE-8286
> URL: https://issues.apache.org/jira/browse/LUCENE-8286
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Priority: Major
>
> The new Weight.matches() API should allow the UnifiedHighlighter to more 
> accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903.
> In addition, this API should make the job of highlighting easier, reducing 
> the LOC and related complexities, especially the UH's PhraseHelper.  Note: 
> reducing/removing PhraseHelper is not a near-term goal since Weight.matches 
> is experimental and incomplete, and perhaps we'll discover some gaps in 
> flexibility/functionality.
> This issue should introduce a new UnifiedHighlighter.HighlightFlag enum 
> option for this method of highlighting.   Perhaps call it {{WEIGHT_MATCHES}}? 
>  Longer term it could go away and it'll be implied if you specify enum values 
> for PHRASES & MULTI_TERM_QUERY?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy

2018-05-03 Thread Adrien Grand (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462361#comment-16462361
 ] 

Adrien Grand commented on LUCENE-8286:
--

bq. MI has things we don't need – position spans

I don't know the unified highlighter well, but I would expect this information 
to be important to score passages? For instance if you run a sloppy phrase 
query, matches that have a smaller width should get a higher weight, shouldn't 
they?

> UnifiedHighlighter should support the new Weight.matches API for better match 
> accuracy
> --
>
> Key: LUCENE-8286
> URL: https://issues.apache.org/jira/browse/LUCENE-8286
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Priority: Major
>
> The new Weight.matches() API should allow the UnifiedHighlighter to more 
> accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903.
> In addition, this API should make the job of highlighting easier, reducing 
> the LOC and related complexities, especially the UH's PhraseHelper.  Note: 
> reducing/removing PhraseHelper is not a near-term goal since Weight.matches 
> is experimental and incomplete, and perhaps we'll discover some gaps in 
> flexibility/functionality.
> This issue should introduce a new UnifiedHighlighter.HighlightFlag enum 
> option for this method of highlighting.   Perhaps call it {{WEIGHT_MATCHES}}? 
>  Longer term it could go away and it'll be implied if you specify enum values 
> for PHRASES & MULTI_TERM_QUERY?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy

2018-05-02 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461646#comment-16461646
 ] 

David Smiley commented on LUCENE-8286:
--

{quote}there is no way to retrieve the term/query in the matches iterator
{quote}
Oh I see – this was removed in LUCENE-8270!  I was loosely following the 
related issues but overlooked that.    [~romseygeek] the statement in the 
description "we don't have a clear use-case for this yet" surprises me; it's 
clearly _highlighting_; no?  Despite this blocker, maybe I could put together a 
patch here, one that has poor scoring because we don't know the term, and that 
will help identify how a matchesIterator.term() could be used?
{quote}One thing we could do to simplify the transition is to remove 
OffsetsEnum entirely and replace it with the MatchesIterator, appart from the 
missing bits I described above this should be easy to do.
{quote}
Or make OE extend MatchesIterator?  It has things we need – term(), freq().  MI 
has things we don't need – position spans, but these can be ignored.
{quote}we can't easily use term vectors for a single field with Matches.
{quote}
Interesting; I'll take a closer look.

> UnifiedHighlighter should support the new Weight.matches API for better match 
> accuracy
> --
>
> Key: LUCENE-8286
> URL: https://issues.apache.org/jira/browse/LUCENE-8286
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Priority: Major
>
> The new Weight.matches() API should allow the UnifiedHighlighter to more 
> accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903.
> In addition, this API should make the job of highlighting easier, reducing 
> the LOC and related complexities, especially the UH's PhraseHelper.  Note: 
> reducing/removing PhraseHelper is not a near-term goal since Weight.matches 
> is experimental and incomplete, and perhaps we'll discover some gaps in 
> flexibility/functionality.
> This issue should introduce a new UnifiedHighlighter.HighlightFlag enum 
> option for this method of highlighting.   Perhaps call it {{WEIGHT_MATCHES}}? 
>  Longer term it could go away and it'll be implied if you specify enum values 
> for PHRASES & MULTI_TERM_QUERY?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy

2018-05-02 Thread Alan Woodward (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461226#comment-16461226
 ] 

Alan Woodward commented on LUCENE-8286:
---

There's an API mismatch in how offsets are retrieved, per-field in the 
UnifiedHighlighter and per-leafreader in the Matches API, which means that (for 
example) we can't easily use term vectors for a single field with Matches.  So 
that will need to be resolved somehow.

> UnifiedHighlighter should support the new Weight.matches API for better match 
> accuracy
> --
>
> Key: LUCENE-8286
> URL: https://issues.apache.org/jira/browse/LUCENE-8286
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Priority: Major
>
> The new Weight.matches() API should allow the UnifiedHighlighter to more 
> accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903.
> In addition, this API should make the job of highlighting easier, reducing 
> the LOC and related complexities, especially the UH's PhraseHelper.  Note: 
> reducing/removing PhraseHelper is not a near-term goal since Weight.matches 
> is experimental and incomplete, and perhaps we'll discover some gaps in 
> flexibility/functionality.
> This issue should introduce a new UnifiedHighlighter.HighlightFlag enum 
> option for this method of highlighting.   Perhaps call it {{WEIGHT_MATCHES}}? 
>  Longer term it could go away and it'll be implied if you specify enum values 
> for PHRASES & MULTI_TERM_QUERY?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy

2018-05-02 Thread Jim Ferenczi (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461105#comment-16461105
 ] 

Jim Ferenczi commented on LUCENE-8286:
--

I also think that it would greatly simplify the code (especially PhraseHelper 
;) ) but matches require some changes to allow this replacement. First of all 
there is no way to retrieve the term/query in the matches iterator so it's not 
possible to count the number of occurrences of a specific query or the total 
frequency in the document. These informations are needed to compute the score 
of a passage so we need to add something in matches.
The matches iterator can return duplicates (if the same term is present in 
multiple clauses) and will soon be able to return matches from phrases (rather 
than individual terms), this means that we'll need to detect overlapping 
intervals when the passages are built. I see this as an improvement since it 
would allow to highlight entire phrases but for spans we'll need an option to 
split matches interval since a span near (or any other span query) can have big 
gaps so it would not make sense to highlight the entire match in a single 
highlight.
One thing we could do to simplify the transition is to remove OffsetsEnum 
entirely and replace it with the MatchesIterator, appart from the missing bits 
I described above this should be easy to do.


> UnifiedHighlighter should support the new Weight.matches API for better match 
> accuracy
> --
>
> Key: LUCENE-8286
> URL: https://issues.apache.org/jira/browse/LUCENE-8286
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Priority: Major
>
> The new Weight.matches() API should allow the UnifiedHighlighter to more 
> accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903.
> In addition, this API should make the job of highlighting easier, reducing 
> the LOC and related complexities, especially the UH's PhraseHelper.  Note: 
> reducing/removing PhraseHelper is not a near-term goal since Weight.matches 
> is experimental and incomplete, and perhaps we'll discover some gaps in 
> flexibility/functionality.
> This issue should introduce a new UnifiedHighlighter.HighlightFlag enum 
> option for this method of highlighting.   Perhaps call it {{WEIGHT_MATCHES}}? 
>  Longer term it could go away and it'll be implied if you specify enum values 
> for PHRASES & MULTI_TERM_QUERY?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy

2018-05-01 Thread David Smiley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459899#comment-16459899
 ] 

David Smiley commented on LUCENE-8286:
--

Per chance do you have any WIP code for this or do any concerns come to your 
mind [~romseygeek]?
Perhaps I'll get around to this issue in a couple weeks.

> UnifiedHighlighter should support the new Weight.matches API for better match 
> accuracy
> --
>
> Key: LUCENE-8286
> URL: https://issues.apache.org/jira/browse/LUCENE-8286
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/highlighter
>Reporter: David Smiley
>Priority: Major
>
> The new Weight.matches() API should allow the UnifiedHighlighter to more 
> accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903.
> In addition, this API should make the job of highlighting easier, reducing 
> the LOC and related complexities, especially the UH's PhraseHelper.  Note: 
> reducing/removing PhraseHelper is not a near-term goal since Weight.matches 
> is experimental and incomplete, and perhaps we'll discover some gaps in 
> flexibility/functionality.
> This issue should introduce a new UnifiedHighlighter.HighlightFlag enum 
> option for this method of highlighting.   Perhaps call it {{WEIGHT_MATCHES}}? 
>  Longer term it could go away and it'll be implied if you specify enum values 
> for PHRASES & MULTI_TERM_QUERY?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org