I have a requirement to highlight search results, and to display documents with matching terms highlighted in the context of the original XML document structure.

It seems like this must be a very common use case, but I am having trouble finding a way to accomplish what we need to do using solr and/or lucene. Using the standard highlighting support in solr, we have been able to retrieve KWIC text fragments for search results, which is great. But what we would ideally like to do is to apply similar highlighting logic while preserving the original document structure.

1) When the user selects a matching document, we render it as HTML with paragraphs, headers, text styles such as italics, and so on, so we need to highlight either the rendered HTML or the original XML and then process that. We need to find the text fragments that matched the original query and highlight those. And this has to use the same logic used by solr/lucene to do the searching, so that the tokenization and analysis is applied properly, and query semantics are respected: if the original query was a phrase query, only phrases should match, and so on.

2) In addition, we also want to be able to display KWIC phrases that are rendered with type styles based on the original XML; this requires some XML tree surgery in order to pull out a fragment of a structured document while preserving enough xml structure to render type styles, which we can do, but it also requires a mapping of matching tokens back into the original document.

I am hoping this is a solved problem, but if not, I'd also be interested in pointers to the best places to start an implementation. I think the problem at base is to maintain a map relating positions of matching terms in the indexed and stored field in lucene to corresponding positions in an original XML document. Ideally the original positions could be stored directly in term vectors, but they could also be translated at render/highlight time using an additional lookup.

I see code in org.apache.lucene.search.highlight in solr and also something in lucene/contrib/highlighter. Is that the state of the art now, or is there anywhere else I should be looking as well?

Thanks for any pointers

-Mike Sokolov

Reply via email to