Hi all. I'm trying to find a precise and reasonably efficient way to highlight all occurrences of terms in the query, only highlighting fields which match the corresponding fields used in the query. This seems like it would be a fairly common requirement in applications. We have an existing implementation, but it works by re-reading the entire text back through the analyser. This is slow for large text, and sometimes we analyse the same text twice - and both variants could well be in the query. So I'm looking for a shortcut.
Perhaps due to the name, Lucene's highlighter module got my attention, so I tried using that. The prototype I wrote *did* produce acceptable results for the highlighting itself, but when it came time to think about integrating into the real application, there didn't seem to be a single part of the highlighter API designed to allow for that. So I guess I will be forced to categorise lucene-highlighter as a "toy", or perhaps as a fairly complete example of how to do highlighting, and it might be useful for that at least. What's wrong with the API? Issue #1 - The API forces me to pass in a String. Just because the highlighter wants some character data, I have to pass String. Text can be very large and I would rather not have to wait for the entire text to read into memory before I can pass it off to the highlighter. String is a final class, so any API which requires it for feeding in something like character data is committing a massive sin, in my opinion. If your text is in a database, you will have to retrieve *all* of the text before you can use *any* of it for highlighting. Had the API accepted something like Reader, CharBuffer or even CharSequence, there would be no problem. We could make an alternative implementation which reads directly from whatever storage it's in. I notice that PostingsHighlighter has improved on this, by removing the need for the text entirely. That's awesome, actually. We can't use it. We're stuck on version 3.6.2 as we are expected to be able to open indexes created in 2.x. Plus, all our existing indexes lack the required level of indexing to use it, and reindexing is not yet an option. (Even if we get lucky enough to update to Lucene 4, I will probably have to write a codec to read Lucene 2 indexes...) Issue #2 - The API returns all results as String. To actually integrate a highlighter, the absolute offsets are the bare minimum requirement to highlight the text: http://docs.oracle.com/javase/7/docs/api/javax/swing/text/Highlighter.html But the highlighter API only returns results as String. Even if there were enough information in the string (and I don't think there is!), getting the results back as String is what I call the "pseudo-API anti-pattern." I shouldn't have to parse values out of a string which the API I'm calling just formatted into it. In this particular instance, it would have been nice to have a way to programmatically get the offset of the highlights in each fragment. As for our own requirements, the bit about computing the fragments is completely unnecessary. We have a piece of view-time logic which figures out the fragments based on where the highlights are vertically. This works better than using text proximity, because using text proximity causes the number of highlighted lines to visibly shuffle, whereas showing consistently the same number of lines above and below produces an effect similar to resizing a text editor window, which people should already be used to. For getting the highlights themselves, is there any faster way than reading the whole text every time you want to run it? TX --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org