Greets, I'm currently contemplating how to implement a Highlighter along the lines of what's described in JIRA issue LUCENE-1522.
https://issues.apache.org/jira/browse/LUCENE-1522 The tricky part of highlighting is selecting one or more excerpts. Unless we end up with a very strange excerpting algorithm that loses positional information, applying the actual highlighting tags at the end will be a straightforward, mechanical process. To select the best fragments, we will create objects representing candidates and assemble them in a PriorityQueue. The queue's Less_Than() method will favor "better" fragments. Whether Less_Than() performs real analysis on the fly, or whether we score the fragments in advance and have Less_Than simply compare fixed scores has yet to be determined, but the important point is that we must design our OO hierarchy and fragment ranking infrastructure to support as many excerpting algorithms as possible. There will be two main sources of input to the fragment ranker. * Information from the Query about where it matched. * Sentence boundary information. Right now in the KS implementation, sentence boundary information is calculated on the fly at runtime, via Highlighter_Find_Sentences(). However, this seems wasteful, because sentence boundaries can be known at index-time. Perhaps we ought to be storing sentence boundary information in the index. Information from the query about where it matched can be gleaned from the weighted-query class. In Lucene, this class goes by the truly dreadful name of "Weight"; in KinoSearch, it goes by the still-dreadful-but-different name "Compiler". We need to use Weight/Compiler rather than Query because some algorithms (e.g. current KS) depend on IDF, which is only known after weighting the Query against a given collection of documents to produce a Weight/Compiler. The current KS implementation, Compiler's Highlight_Spans() method, returns a VArray of Span objects, each of which has an offset (measured in Unicode code points from the top of the field), a length (also a count of Unicode code points), and a floating point "weight". Compound queries such as ANDQuery and ORQuery simply produce a VArray which unions the Spans produced by their child nodes. I have an intuitive feeling that an array of weighted score spans will be useful in other contexts besides highlighting, and the method is pretty easy to grok. However, this simple algo has a flaw: it's not clear what part of the query produced each score span. Some algorithms will want to influence selection of fragments based on term diversity, so that for example, multiple fragments would be preferred over a single fragment if they represented different parts of the query. Perhaps if each Span were to include a reference to the original Query object which produced it? These would be primitives such as TermQuery and PhraseQuery rather than compound queries like ANDQuery. Would that reference be enough to implement a preference for term diversity in the excerpting algo? And might that information come in handy for other excerpting algos? Marvin Humphrey
