[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy
[ https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597034#comment-16597034 ] ASF subversion and git services commented on LUCENE-8286: - Commit bf7d1078e4ef6c99abaf5c76eccf56ed0f09f553 in lucene-solr's branch refs/heads/branch_7x from [~dsmiley] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=bf7d107 ] LUCENE-8286: UnifiedHighlighter: new HighlightFlag.WEIGHT_MATCHES for MatchesIterator API. Other API changes: New UHComponents, and FieldOffsetStrategy takes a LeafReader not IndexReader now. Closes #409 (cherry picked from commit b19ae942f154924b9108c4e0409865128f2a07d4) > UnifiedHighlighter should support the new Weight.matches API for better match > accuracy > -- > > Key: LUCENE-8286 > URL: https://issues.apache.org/jira/browse/LUCENE-8286 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > The new Weight.matches() API should allow the UnifiedHighlighter to more > accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903. > In addition, this API should make the job of highlighting easier, reducing > the LOC and related complexities, especially the UH's PhraseHelper. Note: > reducing/removing PhraseHelper is not a near-term goal since Weight.matches > is experimental and incomplete, and perhaps we'll discover some gaps in > flexibility/functionality. > This issue should introduce a new UnifiedHighlighter.HighlightFlag enum > option for this method of highlighting. Perhaps call it {{WEIGHT_MATCHES}}? > Longer term it could go away and it'll be implied if you specify enum values > for PHRASES & MULTI_TERM_QUERY? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy
[ https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16597019#comment-16597019 ] ASF subversion and git services commented on LUCENE-8286: - Commit b19ae942f154924b9108c4e0409865128f2a07d4 in lucene-solr's branch refs/heads/master from [~dsmiley] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b19ae94 ] LUCENE-8286: UnifiedHighlighter: new HighlightFlag.WEIGHT_MATCHES for MatchesIterator API. Other API changes: New UHComponents, and FieldOffsetStrategy takes a LeafReader not IndexReader now. Closes #409 > UnifiedHighlighter should support the new Weight.matches API for better match > accuracy > -- > > Key: LUCENE-8286 > URL: https://issues.apache.org/jira/browse/LUCENE-8286 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > The new Weight.matches() API should allow the UnifiedHighlighter to more > accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903. > In addition, this API should make the job of highlighting easier, reducing > the LOC and related complexities, especially the UH's PhraseHelper. Note: > reducing/removing PhraseHelper is not a near-term goal since Weight.matches > is experimental and incomplete, and perhaps we'll discover some gaps in > flexibility/functionality. > This issue should introduce a new UnifiedHighlighter.HighlightFlag enum > option for this method of highlighting. Perhaps call it {{WEIGHT_MATCHES}}? > Longer term it could go away and it'll be implied if you specify enum values > for PHRASES & MULTI_TERM_QUERY? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy
[ https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16593692#comment-16593692 ] David Smiley commented on LUCENE-8286: -- _The PR is ready to go I think._ I'll commit in a couple days. OE.getTerm is implemented consistent with how the others work. I also tracked down a curious observation in one of the tests ( *not* for MatchesIterator) that revealed that sloppy phrase queries sometimes won't highlight faithfully to the original because WeightedSpanTermExtractor's conversion of a PhraseQuery to a SpanQuery will set inOrder=false when there is slop. This just goes to show that MatchesIterator based highlighting is more accurate in multiple ways. Suggested CHANGES.txt: The UnifiedHighlighter now has a new experimental HighlightFlag.WEIGHT_MATCHES flag that causes it to use Lucene's new Weight.getMatches API. This will more accurately and strictly highlight, solving issues like LUCENE-7903. Phrases will be formatted with a single span per occurrence instead of its words separately. Passage relevancy might be degraded, however, since "freq" isn't calculated. The flag is disabled by default. There were some API changes that are public but internal to the UH, including a new UHComponents class. > UnifiedHighlighter should support the new Weight.matches API for better match > accuracy > -- > > Key: LUCENE-8286 > URL: https://issues.apache.org/jira/browse/LUCENE-8286 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > The new Weight.matches() API should allow the UnifiedHighlighter to more > accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903. > In addition, this API should make the job of highlighting easier, reducing > the LOC and related complexities, especially the UH's PhraseHelper. Note: > reducing/removing PhraseHelper is not a near-term goal since Weight.matches > is experimental and incomplete, and perhaps we'll discover some gaps in > flexibility/functionality. > This issue should introduce a new UnifiedHighlighter.HighlightFlag enum > option for this method of highlighting. Perhaps call it {{WEIGHT_MATCHES}}? > Longer term it could go away and it'll be implied if you specify enum values > for PHRASES & MULTI_TERM_QUERY? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy
[ https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578835#comment-16578835 ] David Smiley commented on LUCENE-8286: -- Actually before continuing with any of that, I think the PR is *almost* good enough for this new mode. It's not on by default so you have to opt-in. If you do opt-in, you get * Better matching accuracy, particularly with nested conjunction/disjunction (solving LUCENE-7903). * Phrase queries will have highlights spanning more naturally instead of per-term. Cosmetic but nice. SpanQuery nested stuff is as-before in this regard, though. * Passage scoring won't be as good due to a constant freq(). Some users won't care; arguably diversity of terms is more important, particularly in a snippet. Note that consideration of freq() can be dialed down to nothing by setting the "k1" BM25 param of DefaultPassageScorer to 0, and this is in fact tested as having such an effect. The only thing needed that isn't too disruptive is implementing OffsetsEnum.getTerm(). The current nocommit of an empty term is bad because it's also used in DefaultPassageScorer to calculate per-term stats. So this ought to return the actual term, although it'd be fine if it was actually a query.toString() in truth. So in the interest of getting an experimental feature out the door, I think I'll do the latter. Only someone customizing the scorer or formatter in a way to depend on the nature of the term would be impacted by that. > UnifiedHighlighter should support the new Weight.matches API for better match > accuracy > -- > > Key: LUCENE-8286 > URL: https://issues.apache.org/jira/browse/LUCENE-8286 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > The new Weight.matches() API should allow the UnifiedHighlighter to more > accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903. > In addition, this API should make the job of highlighting easier, reducing > the LOC and related complexities, especially the UH's PhraseHelper. Note: > reducing/removing PhraseHelper is not a near-term goal since Weight.matches > is experimental and incomplete, and perhaps we'll discover some gaps in > flexibility/functionality. > This issue should introduce a new UnifiedHighlighter.HighlightFlag enum > option for this method of highlighting. Perhaps call it {{WEIGHT_MATCHES}}? > Longer term it could go away and it'll be implied if you specify enum values > for PHRASES & MULTI_TERM_QUERY? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy
[ https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16578802#comment-16578802 ] David Smiley commented on LUCENE-8286: -- Made substantial progress to the PR: {noformat} LUCENE-8286 UH: Use MI.getSubMatches(). Removed PhraseHelper changes; not necessary anymore. Updated based on MI improvements in master. With subMatches, we have better fidelity on span queries. And since MI can handle span queries now, no need to touch PhraseHelper. * added to UHComponents: query, and highlightFlags * updated tests to handle with/without WEIGHT_MATCHES * TestUnifiedHighlighterStrictPhrases uses more randomization. Removed brittle score calculation dependence. * Test Passage matches data is in order TODO: OE freq & term() {noformat} It was nice to see that UH's PhraseHelper can be circumvented now. Handling mi.getSubMatches proved to be difficult, but I ultimately got it working. See https://github.com/dsmiley/lucene-solr/blob/LUCENE-8286/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/OffsetsEnum.java#L168 Next up is handling OffsetsEnum.getTerm(). I could change the API so that getTerm() returns getQuery() and consequently update Passage & PassageScorer. Callers of getTerm() were all internal or considered experimental any way (definitely not in common use) so I think it could change in a minor release. I hope multi-term query types will be retained as such but I fear MatchesIterator expands before retaining the original, and thus the results here won't be as ideal but adequate. Then, OffsetsEnum.freq(). This one is hard. We could make "-1" an unsupported value. Then, a new PassageScorer design that is created per highlighted field value could be given access to the IndexReader in org.apache.lucene.search.uhighlight.FieldHighlighter#highlightOffsetsEnums. When it sees -1 at scoring time, it could calculate the in-doc freq and cache it. Or similarly... maybe we don't care that much about the in-doc freq; it may be expensive to calculate any way. Maybe we want the associated Query's score for this document (which will consider global stats like IDF), but again will need access to the IndexReader. It'd be nice if boosts wrapped around the query could be considered but it's just not there (also true without MI mode). > UnifiedHighlighter should support the new Weight.matches API for better match > accuracy > -- > > Key: LUCENE-8286 > URL: https://issues.apache.org/jira/browse/LUCENE-8286 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > The new Weight.matches() API should allow the UnifiedHighlighter to more > accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903. > In addition, this API should make the job of highlighting easier, reducing > the LOC and related complexities, especially the UH's PhraseHelper. Note: > reducing/removing PhraseHelper is not a near-term goal since Weight.matches > is experimental and incomplete, and perhaps we'll discover some gaps in > flexibility/functionality. > This issue should introduce a new UnifiedHighlighter.HighlightFlag enum > option for this method of highlighting. Perhaps call it {{WEIGHT_MATCHES}}? > Longer term it could go away and it'll be implied if you specify enum values > for PHRASES & MULTI_TERM_QUERY? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy
[ https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540745#comment-16540745 ] David Smiley commented on LUCENE-8286: -- I updated the PR significantly. It addresses requireFieldMatch/fieldMatcher and some other cases. [~romseygeek] you might find it worthwhile to see the changes as some are applicable to the highlighter you're working on. See OverlaySingleDocTermsLeafReader in particular. The different aspects of the changes were reasonably separated out to separate commits. There are a couple nocommits. There are a few failing tests but before I can make substantive progress at this point, it's dependent on getting access to the matching terms for passage scoring. > UnifiedHighlighter should support the new Weight.matches API for better match > accuracy > -- > > Key: LUCENE-8286 > URL: https://issues.apache.org/jira/browse/LUCENE-8286 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Priority: Major > Attachments: LUCENE-8286.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The new Weight.matches() API should allow the UnifiedHighlighter to more > accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903. > In addition, this API should make the job of highlighting easier, reducing > the LOC and related complexities, especially the UH's PhraseHelper. Note: > reducing/removing PhraseHelper is not a near-term goal since Weight.matches > is experimental and incomplete, and perhaps we'll discover some gaps in > flexibility/functionality. > This issue should introduce a new UnifiedHighlighter.HighlightFlag enum > option for this method of highlighting. Perhaps call it {{WEIGHT_MATCHES}}? > Longer term it could go away and it'll be implied if you specify enum values > for PHRASES & MULTI_TERM_QUERY? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy
[ https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16521825#comment-16521825 ] David Smiley commented on LUCENE-8286: -- I pushed the patch to GitHub as a linked PR. The only change is reverting the choice of having OffsetsEnum implement MatchesIterator as it turned out to be rather pointless. > UnifiedHighlighter should support the new Weight.matches API for better match > accuracy > -- > > Key: LUCENE-8286 > URL: https://issues.apache.org/jira/browse/LUCENE-8286 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Priority: Major > Attachments: LUCENE-8286.patch > > Time Spent: 10m > Remaining Estimate: 0h > > The new Weight.matches() API should allow the UnifiedHighlighter to more > accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903. > In addition, this API should make the job of highlighting easier, reducing > the LOC and related complexities, especially the UH's PhraseHelper. Note: > reducing/removing PhraseHelper is not a near-term goal since Weight.matches > is experimental and incomplete, and perhaps we'll discover some gaps in > flexibility/functionality. > This issue should introduce a new UnifiedHighlighter.HighlightFlag enum > option for this method of highlighting. Perhaps call it {{WEIGHT_MATCHES}}? > Longer term it could go away and it'll be implied if you specify enum values > for PHRASES & MULTI_TERM_QUERY? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy
[ https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16507691#comment-16507691 ] David Smiley commented on LUCENE-8286: -- The first patch here is my working WIP. Everything compiles and the results are generally reasonable, notwithstanding some known issues already pointed out from my previous comment. I enabled it by default and then looked to see what tests broke and why: * TestUnifiedHighlighter: all failures are for the testFieldMatcher methods since the fieldMatcher mechanism doesn't yet work with this (mentioned in prev comment) * TestUnifiedHighlighterMTQ.testWhichMTQMatched: because MatchesIterator doesn't yet expose which term matched. * TestUnifiedHighlighterRanking: failed because the scoring isn't the same * TestUnifiedHighlighterTermVec.testFetchTermVecsOncePerDoc: randomly fails because sometimes the underlying fields don't have a real index. The UH highlights one field at a time and _that_ field being highlighted will be made to appear as indexed if it wasn't already (e.g. re-analysis into MemoryIndex or TV LeafReader wrapper) but no other fields will be. I think once a solution to fieldMatcher works, it may solve the situation here. * TestUnifiedHighlighterStrictPhrases: i haven't reviewed each failure yet but it all seems to be due to the distinction between highlighting words in phrases by themselves or highlighting the phrase span. All the assertions assume words by themselves. What's cool is that this wasn't a big change, and it can be intermixed with SpanQueries. I need to look at the scoring options more -- loss of freq() is a shame. > UnifiedHighlighter should support the new Weight.matches API for better match > accuracy > -- > > Key: LUCENE-8286 > URL: https://issues.apache.org/jira/browse/LUCENE-8286 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Priority: Major > Attachments: LUCENE-8286.patch > > > The new Weight.matches() API should allow the UnifiedHighlighter to more > accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903. > In addition, this API should make the job of highlighting easier, reducing > the LOC and related complexities, especially the UH's PhraseHelper. Note: > reducing/removing PhraseHelper is not a near-term goal since Weight.matches > is experimental and incomplete, and perhaps we'll discover some gaps in > flexibility/functionality. > This issue should introduce a new UnifiedHighlighter.HighlightFlag enum > option for this method of highlighting. Perhaps call it {{WEIGHT_MATCHES}}? > Longer term it could go away and it'll be implied if you specify enum values > for PHRASES & MULTI_TERM_QUERY? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy
[ https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462466#comment-16462466 ] David Smiley commented on LUCENE-8286: -- The "span" width _could_ be used for passage relevancy, and perhaps ought to be – sure. I just meant to convey that today the UH doesn't have or use this info. BTW I did a quick hack integration last night of Weight.getMatches into the UH and ran some tests. I had no issue with term vectors. The fieldMatcher (aka requireFieldMatch option) will require some work. And if the query references non-highlighted fields in a way that will constraint the results (i.e. MUST otherfield:foo), for the Analysis offset strategy, we'll need to combine an aggregate index view of analysis with the underlying real index for other fields because the MemoryIndex alone only has one field – the field being highlighted. > UnifiedHighlighter should support the new Weight.matches API for better match > accuracy > -- > > Key: LUCENE-8286 > URL: https://issues.apache.org/jira/browse/LUCENE-8286 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Priority: Major > > The new Weight.matches() API should allow the UnifiedHighlighter to more > accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903. > In addition, this API should make the job of highlighting easier, reducing > the LOC and related complexities, especially the UH's PhraseHelper. Note: > reducing/removing PhraseHelper is not a near-term goal since Weight.matches > is experimental and incomplete, and perhaps we'll discover some gaps in > flexibility/functionality. > This issue should introduce a new UnifiedHighlighter.HighlightFlag enum > option for this method of highlighting. Perhaps call it {{WEIGHT_MATCHES}}? > Longer term it could go away and it'll be implied if you specify enum values > for PHRASES & MULTI_TERM_QUERY? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy
[ https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462361#comment-16462361 ] Adrien Grand commented on LUCENE-8286: -- bq. MI has things we don't need – position spans I don't know the unified highlighter well, but I would expect this information to be important to score passages? For instance if you run a sloppy phrase query, matches that have a smaller width should get a higher weight, shouldn't they? > UnifiedHighlighter should support the new Weight.matches API for better match > accuracy > -- > > Key: LUCENE-8286 > URL: https://issues.apache.org/jira/browse/LUCENE-8286 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Priority: Major > > The new Weight.matches() API should allow the UnifiedHighlighter to more > accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903. > In addition, this API should make the job of highlighting easier, reducing > the LOC and related complexities, especially the UH's PhraseHelper. Note: > reducing/removing PhraseHelper is not a near-term goal since Weight.matches > is experimental and incomplete, and perhaps we'll discover some gaps in > flexibility/functionality. > This issue should introduce a new UnifiedHighlighter.HighlightFlag enum > option for this method of highlighting. Perhaps call it {{WEIGHT_MATCHES}}? > Longer term it could go away and it'll be implied if you specify enum values > for PHRASES & MULTI_TERM_QUERY? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy
[ https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461646#comment-16461646 ] David Smiley commented on LUCENE-8286: -- {quote}there is no way to retrieve the term/query in the matches iterator {quote} Oh I see – this was removed in LUCENE-8270! I was loosely following the related issues but overlooked that. [~romseygeek] the statement in the description "we don't have a clear use-case for this yet" surprises me; it's clearly _highlighting_; no? Despite this blocker, maybe I could put together a patch here, one that has poor scoring because we don't know the term, and that will help identify how a matchesIterator.term() could be used? {quote}One thing we could do to simplify the transition is to remove OffsetsEnum entirely and replace it with the MatchesIterator, appart from the missing bits I described above this should be easy to do. {quote} Or make OE extend MatchesIterator? It has things we need – term(), freq(). MI has things we don't need – position spans, but these can be ignored. {quote}we can't easily use term vectors for a single field with Matches. {quote} Interesting; I'll take a closer look. > UnifiedHighlighter should support the new Weight.matches API for better match > accuracy > -- > > Key: LUCENE-8286 > URL: https://issues.apache.org/jira/browse/LUCENE-8286 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Priority: Major > > The new Weight.matches() API should allow the UnifiedHighlighter to more > accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903. > In addition, this API should make the job of highlighting easier, reducing > the LOC and related complexities, especially the UH's PhraseHelper. Note: > reducing/removing PhraseHelper is not a near-term goal since Weight.matches > is experimental and incomplete, and perhaps we'll discover some gaps in > flexibility/functionality. > This issue should introduce a new UnifiedHighlighter.HighlightFlag enum > option for this method of highlighting. Perhaps call it {{WEIGHT_MATCHES}}? > Longer term it could go away and it'll be implied if you specify enum values > for PHRASES & MULTI_TERM_QUERY? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy
[ https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461226#comment-16461226 ] Alan Woodward commented on LUCENE-8286: --- There's an API mismatch in how offsets are retrieved, per-field in the UnifiedHighlighter and per-leafreader in the Matches API, which means that (for example) we can't easily use term vectors for a single field with Matches. So that will need to be resolved somehow. > UnifiedHighlighter should support the new Weight.matches API for better match > accuracy > -- > > Key: LUCENE-8286 > URL: https://issues.apache.org/jira/browse/LUCENE-8286 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Priority: Major > > The new Weight.matches() API should allow the UnifiedHighlighter to more > accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903. > In addition, this API should make the job of highlighting easier, reducing > the LOC and related complexities, especially the UH's PhraseHelper. Note: > reducing/removing PhraseHelper is not a near-term goal since Weight.matches > is experimental and incomplete, and perhaps we'll discover some gaps in > flexibility/functionality. > This issue should introduce a new UnifiedHighlighter.HighlightFlag enum > option for this method of highlighting. Perhaps call it {{WEIGHT_MATCHES}}? > Longer term it could go away and it'll be implied if you specify enum values > for PHRASES & MULTI_TERM_QUERY? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy
[ https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461105#comment-16461105 ] Jim Ferenczi commented on LUCENE-8286: -- I also think that it would greatly simplify the code (especially PhraseHelper ;) ) but matches require some changes to allow this replacement. First of all there is no way to retrieve the term/query in the matches iterator so it's not possible to count the number of occurrences of a specific query or the total frequency in the document. These informations are needed to compute the score of a passage so we need to add something in matches. The matches iterator can return duplicates (if the same term is present in multiple clauses) and will soon be able to return matches from phrases (rather than individual terms), this means that we'll need to detect overlapping intervals when the passages are built. I see this as an improvement since it would allow to highlight entire phrases but for spans we'll need an option to split matches interval since a span near (or any other span query) can have big gaps so it would not make sense to highlight the entire match in a single highlight. One thing we could do to simplify the transition is to remove OffsetsEnum entirely and replace it with the MatchesIterator, appart from the missing bits I described above this should be easy to do. > UnifiedHighlighter should support the new Weight.matches API for better match > accuracy > -- > > Key: LUCENE-8286 > URL: https://issues.apache.org/jira/browse/LUCENE-8286 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Priority: Major > > The new Weight.matches() API should allow the UnifiedHighlighter to more > accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903. > In addition, this API should make the job of highlighting easier, reducing > the LOC and related complexities, especially the UH's PhraseHelper. Note: > reducing/removing PhraseHelper is not a near-term goal since Weight.matches > is experimental and incomplete, and perhaps we'll discover some gaps in > flexibility/functionality. > This issue should introduce a new UnifiedHighlighter.HighlightFlag enum > option for this method of highlighting. Perhaps call it {{WEIGHT_MATCHES}}? > Longer term it could go away and it'll be implied if you specify enum values > for PHRASES & MULTI_TERM_QUERY? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8286) UnifiedHighlighter should support the new Weight.matches API for better match accuracy
[ https://issues.apache.org/jira/browse/LUCENE-8286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459899#comment-16459899 ] David Smiley commented on LUCENE-8286: -- Per chance do you have any WIP code for this or do any concerns come to your mind [~romseygeek]? Perhaps I'll get around to this issue in a couple weeks. > UnifiedHighlighter should support the new Weight.matches API for better match > accuracy > -- > > Key: LUCENE-8286 > URL: https://issues.apache.org/jira/browse/LUCENE-8286 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/highlighter >Reporter: David Smiley >Priority: Major > > The new Weight.matches() API should allow the UnifiedHighlighter to more > accurately highlight some BooleanQuery patterns correctly -- see LUCENE-7903. > In addition, this API should make the job of highlighting easier, reducing > the LOC and related complexities, especially the UH's PhraseHelper. Note: > reducing/removing PhraseHelper is not a near-term goal since Weight.matches > is experimental and incomplete, and perhaps we'll discover some gaps in > flexibility/functionality. > This issue should introduce a new UnifiedHighlighter.HighlightFlag enum > option for this method of highlighting. Perhaps call it {{WEIGHT_MATCHES}}? > Longer term it could go away and it'll be implied if you specify enum values > for PHRASES & MULTI_TERM_QUERY? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org