Hi all.
I've just implemented some magic query syntax which expands simple
queries to queries containing a whole lists of words.
I've implemented the queries themselves using a slight modification on
the theme of QueryFilter (MultiQueryFilter, runs all queries to mark a
single bitset, much faster than applying a logical OR to many
QueryFilter bitsets and much lower memory than using a single
QueryFilter wrapped around an enormous BooleanQuery.)
The queries are nice and fast, but now it occurs to me that I probably
should highlight the text resulting from the wordlist.
Unfortunately, the contrib/highlighter code in source control fails to
meet our needs in two ways:
1. We don't just want fragments, we want *all* of the text, with
highlights in the appropriate places (although we do offer a means
to display just the fragments as well), and
2. We don't deal with HTML, just plain text on a Swing text component.
In other words we don't have to "format" or modify the text at all,
just tell the Swing component which bits need to be highlighted.
The existing highlighting code we wrote basically works like this...
1. Get the text out of the Swing component.
2. Break the text into tokens using the appropriate Analyzer.
3. For each term:
3.1. Break the term into tokens using the same Analyzer.
3.2. Iterate through the list of text tokens looking for the list
of term tokens (basically find a sublist in a list.)
This has served us well so far, but for enormous numbers of terms it
starts to get quite slow.
Is there a better approach for highlighting for a large number of terms?
For instance, it might be good to skip some terms if I can figure out
that they're not in the document without spending too much time, and it
might also be good to do all the token searches in a single pass, but
I'm not entirely sure how to go about that.
Daniel
--
Daniel Noll
Nuix Australia Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
Phone: (02) 9280 0699
Fax: (02) 9212 6902
This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]