Hi all.

I've just implemented some magic query syntax which expands simple queries to queries containing a whole lists of words.

I've implemented the queries themselves using a slight modification on the theme of QueryFilter (MultiQueryFilter, runs all queries to mark a single bitset, much faster than applying a logical OR to many QueryFilter bitsets and much lower memory than using a single QueryFilter wrapped around an enormous BooleanQuery.)

The queries are nice and fast, but now it occurs to me that I probably should highlight the text resulting from the wordlist.

Unfortunately, the contrib/highlighter code in source control fails to meet our needs in two ways:

  1. We don't just want fragments, we want *all* of the text, with
     highlights in the appropriate places (although we do offer a means
     to display just the fragments as well), and

  2. We don't deal with HTML, just plain text on a Swing text component.
     In other words we don't have to "format" or modify the text at all,
     just tell the Swing component which bits need to be highlighted.

The existing highlighting code we wrote basically works like this...

  1. Get the text out of the Swing component.

  2. Break the text into tokens using the appropriate Analyzer.

  3. For each term:
      3.1. Break the term into tokens using the same Analyzer.
      3.2. Iterate through the list of text tokens looking for the list
           of term tokens (basically find a sublist in a list.)

This has served us well so far, but for enormous numbers of terms it starts to get quite slow.

Is there a better approach for highlighting for a large number of terms? For instance, it might be good to skip some terms if I can figure out that they're not in the document without spending too much time, and it might also be good to do all the token searches in a single pass, but I'm not entirely sure how to go about that.

Daniel


--
Daniel Noll

Nuix Australia Pty Ltd
Suite 79, 89 Jones St, Ultimo NSW 2007, Australia
Phone: (02) 9280 0699
Fax:   (02) 9212 6902

This message is intended only for the named recipient. If you are not
the intended recipient you are notified that disclosing, copying,
distributing or taking any action in reliance on the contents of this
message or attachment is strictly prohibited.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to