Highlighting text, do I seriously have to reimplement this from scratch?

Trejkaz Mon, 03 Feb 2014 22:21:28 -0800

Hi all.

I'm trying to find a precise and reasonably efficient way to highlight
all occurrences of terms in the query, only highlighting fields which
match the corresponding fields used in the query. This seems like it
would be a fairly common requirement in applications. We have an
existing implementation, but it works by re-reading the entire text
back through the analyser. This is slow for large text, and sometimes
we analyse the same text twice - and both variants could well be in
the query. So I'm looking for a shortcut.


Perhaps due to the name, Lucene's highlighter module got my attention,
so I tried using that. The prototype I wrote *did* produce acceptable
results for the highlighting itself, but when it came time to think
about integrating into the real application, there didn't seem to be a
single part of the highlighter API designed to allow for that.

So I guess I will be forced to categorise lucene-highlighter as a
"toy", or perhaps as a fairly complete example of how to do
highlighting, and it might be useful for that at least.

What's wrong with the API?


Issue #1 - The API forces me to pass in a String.

Just because the highlighter wants some character data, I have to pass
String. Text can be very large and I would rather not have to wait for
the entire text to read into memory before I can pass it off to the
highlighter.

String is a final class, so any API which requires it for feeding in
something like character data is committing a massive sin, in my
opinion. If your text is in a database, you will have to retrieve
*all* of the text before you can use *any* of it for highlighting.

Had the API accepted something like Reader, CharBuffer or even
CharSequence, there would be no problem. We could make an alternative
implementation which reads directly from whatever storage it's in.

I notice that PostingsHighlighter has improved on this, by removing
the need for the text entirely. That's awesome, actually. We can't use
it. We're stuck on version 3.6.2 as we are expected to be able to open
indexes created in 2.x. Plus, all our existing indexes lack the
required level of indexing to use it, and reindexing is not yet an
option. (Even if we get lucky enough to update to Lucene 4, I will
probably have to write a codec to read Lucene 2 indexes...)


Issue #2 - The API returns all results as String.

To actually integrate a highlighter, the absolute offsets are the bare
minimum requirement to highlight the text:

    http://docs.oracle.com/javase/7/docs/api/javax/swing/text/Highlighter.html

But the highlighter API only returns results as String.

Even if there were enough information in the string (and I don't think
there is!), getting the results back as String is what I call the
"pseudo-API anti-pattern." I shouldn't have to parse values out of a
string which the API I'm calling just formatted into it. In this
particular instance, it would have been nice to have a way to
programmatically get the offset of the highlights in each fragment.


As for our own requirements, the bit about computing the fragments is
completely unnecessary. We have a piece of view-time logic which
figures out the fragments based on where the highlights are
vertically. This works better than using text proximity, because using
text proximity causes the number of highlighted lines to visibly
shuffle, whereas showing consistently the same number of lines above
and below produces an effect similar to resizing a text editor window,
which people should already be used to.

For getting the highlights themselves, is there any faster way than
reading the whole text every time you want to run it?

TX

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Highlighting text, do I seriously have to reimplement this from scratch?

Reply via email to