This is music to my ears, I came across the same highlighter issues and had a little discussion here:
http://lucene.markmail.org/thread/ppunujq3hjjzq3z7#query:+page:1+mid:b6jweck6b6m2k4n4+state:results Unfortunately I didn’t make much progress on it. -Steve On Oct 10, 2014, at 12:38 AM, david.w.smi...@gmail.com wrote: > I’m working on making highlighting both accurate and fast. By “accurate”, I > mean the highlights need to accurately reflect a match given the query and > various possible query types (to include SpanQueries and MultiTermQueries and > obviously phrase queries and the usual suspects). The fastest highlighter > we’ve got in Lucene is the PostingsHighlighter but it throws out any > positional nature in the query and can highlight more inaccurately than the > other two highlighters. The most accurate is the default highlighter, > although I can see some simplifications it makes that could lead to > inaccuracies. > > The default highlighter’s “WeightedSpanTermExtractor” is interesting — it > uses a MemoryIndex built from re-analyzing the text, and it executes the > query against this mini index; kind of. A recent experiment I did was to > have the MemoryIndex essentially wrap the “Terms” from term vectors. It > works and saves memory, although, at least for large docs (which I’m > optimizing for) the real performance hit is in un-inverting the TokenStream > in TokenSources to include sorting the thousands of tokens -- assuming you > index term vectors of course. But with my attention now on the > PostingsHighlighter (because it’s the fastest and offsets are way cheaper > than term vectors), I believe WeightedSpanTermExtractor could simply use > Lucene’s actual IndexReader — no? It seems so obvious to me now I wonder why > it wasn’t done this way in the first place — all WSTE has to do is advance() > to the document being highlighted for applicable terms. Am I overlooking > something? > > WeightedSpanTermExtractor is somewhat accurate but my reading of its source > shows it takes short-cuts I’d like to eliminate. For example if the query is > “(A && B) || (C && D)” and if the document doesn’t have ‘D’ then it should > ideally NOT highlight ‘C’ in this document, just ‘A’ and ‘B’. I think I can > solve that using Scorers.getChildScorers to see which scorers (and thus > queries) actually matched. Another example is that it views SpanQueries at > the top level only and records the entire span for all terms it is comprised > of. So if you had a couple Phrase SpanQueries (actually ordered 0-slop > SpanNearQueries) joined by a SpanNearQuery to be within ~50 positions of each > other, I believe it would highlight any other occurrence of the words > involved in-between the sub-SpanQueries. This looks hard to solve but I think > for starters, SpanScorer needs a getter for the Spans instance, and > furthermore Spans needs getChildSpans() just as Scorers expose child scorers. > I could see myself relaxing this requirement because of it’s complexity and > simply highlighting the entire span, even if it could be a big highlight. > > Perhaps the “Nuke Spans” effort might make this all much easier but I haven’t > looked yet because that’s still not done yet. It’s encouraging to see Alan > making recent progress there. > > Any thoughts about any of this, guys? > > p.s. When I’m done, I expect to have no problem getting open-source > permission from the sponsor commissioning this effort. > > ~ David Smiley > Freelance Apache Lucene/Solr Search Consultant/Developer > http://www.linkedin.com/in/davidwsmiley