I have a requirement to highlight search results, and to display
documents with matching terms highlighted in the context of the original
XML document structure.
It seems like this must be a very common use case, but I am having
trouble finding a way to accomplish what we need to do using solr and/or
lucene. Using the standard highlighting support in solr, we have been
able to retrieve KWIC text fragments for search results, which is
great. But what we would ideally like to do is to apply similar
highlighting logic while preserving the original document structure.
1) When the user selects a matching document, we render it as HTML with
paragraphs, headers, text styles such as italics, and so on, so we need
to highlight either the rendered HTML or the original XML and then
process that. We need to find the text fragments that matched the
original query and highlight those. And this has to use the same logic
used by solr/lucene to do the searching, so that the tokenization and
analysis is applied properly, and query semantics are respected: if the
original query was a phrase query, only phrases should match, and so on.
2) In addition, we also want to be able to display KWIC phrases that are
rendered with type styles based on the original XML; this requires some
XML tree surgery in order to pull out a fragment of a structured
document while preserving enough xml structure to render type styles,
which we can do, but it also requires a mapping of matching tokens back
into the original document.
I am hoping this is a solved problem, but if not, I'd also be interested
in pointers to the best places to start an implementation. I think the
problem at base is to maintain a map relating positions of matching
terms in the indexed and stored field in lucene to corresponding
positions in an original XML document. Ideally the original positions
could be stored directly in term vectors, but they could also be
translated at render/highlight time using an additional lookup.
I see code in org.apache.lucene.search.highlight in solr and also
something in lucene/contrib/highlighter. Is that the state of the art
now, or is there anywhere else I should be looking as well?
Thanks for any pointers
-Mike Sokolov