This is music to my ears, I came across the same highlighter issues and had a 
little discussion here: 

http://lucene.markmail.org/thread/ppunujq3hjjzq3z7#query:+page:1+mid:b6jweck6b6m2k4n4+state:results

Unfortunately I didn’t make much progress on it.

-Steve

On Oct 10, 2014, at 12:38 AM, david.w.smi...@gmail.com wrote:

> I’m working on making highlighting both accurate and fast.  By “accurate”, I 
> mean the highlights need to accurately reflect a match given the query and 
> various possible query types (to include SpanQueries and MultiTermQueries and 
> obviously phrase queries and the usual suspects).  The fastest highlighter 
> we’ve got in Lucene is the PostingsHighlighter but it throws out any 
> positional nature in the query and can highlight more inaccurately than the 
> other two highlighters. The most accurate is the default highlighter, 
> although I can see some simplifications it makes that could lead to 
> inaccuracies.
> 
> The default highlighter’s “WeightedSpanTermExtractor” is interesting — it 
> uses a MemoryIndex built from re-analyzing the text, and it executes the 
> query against this mini index; kind of.  A recent experiment I did was to 
> have the MemoryIndex essentially wrap the “Terms” from term vectors.  It 
> works and saves memory, although, at least for large docs (which I’m 
> optimizing for) the real performance hit is in un-inverting the TokenStream 
> in TokenSources to include sorting the thousands of tokens -- assuming you 
> index term vectors of course.  But with my attention now on the 
> PostingsHighlighter (because it’s the fastest and offsets are way cheaper 
> than term vectors), I believe WeightedSpanTermExtractor could simply use 
> Lucene’s actual IndexReader — no?  It seems so obvious to me now I wonder why 
> it wasn’t done this way in the first place — all WSTE has to do is advance() 
> to the document being highlighted for applicable terms.  Am I overlooking 
> something?
> 
> WeightedSpanTermExtractor is somewhat accurate but my reading of its source 
> shows it takes short-cuts I’d like to eliminate.  For example if the query is 
> “(A && B) || (C && D)” and if the document doesn’t have ‘D’ then it should 
> ideally NOT highlight ‘C’ in this document, just ‘A’ and ‘B’.  I think I can 
> solve that using Scorers.getChildScorers to see which scorers (and thus 
> queries) actually matched.  Another example is that it views SpanQueries at 
> the top level only and records the entire span for all terms it is comprised 
> of.  So if you had a couple Phrase SpanQueries (actually ordered 0-slop 
> SpanNearQueries) joined by a SpanNearQuery to be within ~50 positions of each 
> other, I believe it would highlight any other occurrence of the words 
> involved in-between the sub-SpanQueries. This looks hard to solve but I think 
> for starters, SpanScorer needs a getter for the Spans instance, and 
> furthermore Spans needs getChildSpans() just as Scorers expose child scorers. 
>  I could see myself relaxing this requirement because of it’s complexity and 
> simply highlighting the entire span, even if it could be a big highlight.
> 
> Perhaps the “Nuke Spans” effort might make this all much easier but I haven’t 
> looked yet because that’s still not done yet.  It’s encouraging to see Alan 
> making recent progress there.
> 
> Any thoughts about any of this, guys?
> 
> p.s. When I’m done, I expect to have no problem getting open-source 
> permission from the sponsor commissioning this effort.
> 
> ~ David Smiley
> Freelance Apache Lucene/Solr Search Consultant/Developer
> http://www.linkedin.com/in/davidwsmiley

Reply via email to