Well, AFAIS the Lucene Highlighters do not offer this functionality via their API, but could easily do.

I think support for highlighting documents would be a very welcome feature. Highlighting HTML documents is already possible with the org.apache.solr.analysis.HTMLStripCharFilter and a NullFragmenter, but ther seems to be nothing for highlighting PDF files...

As starting point I quarried out from org.apache.lucene.search.highlight.Highlighter the class below which just returns the Tokens contributing to the hit.

Using the returned tokens a PDF highlight file could be easily generated and voilà..

-- Wulf

package org.apache.lucene.search.highlight;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;


public class HighlightTokensExtractor
{
  private Scorer fragmentScorer = null;

  public HighlightTokensExtractor(Scorer fragmentScorer)
  {
    this.fragmentScorer = fragmentScorer;
  }

  public final List<Token> getTokens(TokenStream tokenStream, String text,
      boolean mergeContiguousFragments, int maxNumFragments)
      throws IOException, InvalidTokenOffsetsException
  {
    List<Token> result = new ArrayList<Token>();
    TermAttribute termAtt = tokenStream.addAttribute(TermAttribute.class);
OffsetAttribute offsetAtt = tokenStream.addAttribute(OffsetAttribute.class);
    tokenStream.addAttribute(PositionIncrementAttribute.class);
    tokenStream.reset();

    // dummy text fragment
    TextFragment currentFrag = new TextFragment("", 0, 0);
    TokenStream newStream = fragmentScorer.init(tokenStream);
    if (newStream != null) {
      tokenStream = newStream;
    }
    fragmentScorer.startFragment(currentFrag);

    try {

      TokenGroup tokenGroup = new TokenGroup(tokenStream);

for (boolean next = tokenStream.incrementToken(); next; next = tokenStream
          .incrementToken()) {
        if ((offsetAtt.endOffset() > text.length())
            || (offsetAtt.startOffset() > text.length())) {
          throw new InvalidTokenOffsetsException("Token " + termAtt.term()
              + " exceeds length of provided text sized " + text.length());
        }
        if ((tokenGroup.numTokens > 0) && (tokenGroup.isDistinct())) {

          if (tokenGroup.getTotalScore() > 0) {
            System.out.println(tokenGroup.matchStartOffset + " "
                + tokenGroup.matchEndOffset);

result.add((Token)tokenGroup.getToken(tokenGroup.getNumTokens()-1));
          }
          tokenGroup.clear();

        }
        tokenGroup.addToken(fragmentScorer.getTokenScore());

      }

      if (tokenGroup.numTokens > 0) {

        if (tokenGroup.getTotalScore() > 0) {
          System.out.println(tokenGroup.matchStartOffset + " "
              + tokenGroup.matchEndOffset);

result.add((Token)tokenGroup.getToken(tokenGroup.getNumTokens()-1));
        }
      }

      return result;

    }
    finally {
      if (tokenStream != null) {
        try {
          tokenStream.close();
        }
        catch (Exception e) {
        }
      }
    }
  }

}



Am 10.05.2011 12:32, schrieb Wulf Berschin:
Hi all,

in our Lucene 3.0.3-based web application when a user clicks on a hit
link the targeted PDF should be opened in the browser with highlighted
hits.

For this purpose using the Acrobat Highlight File (Parameter xml, see
http://www.pdfbox.org/userguide/highlighting.html and
http://partners.adobe.com/public/developer/en/pdf/HighlightFileFormat.pdf)
seems most reasonable to me.

Since the position to highlight are given by (page and) character
offsets and Lucene uses offsets as well I think it could be easy (for
more Lucene-skilled people than me) to create an Highlighter which
produces this highlight file.

Does such a Highlighter already exists in the Lucene World?

If not could someone please point me the direction (e.g. where to hook
into the existing (fast vector?) highlighter just to extract the offsets).

BTW: Luke gyve me the impression that Term Vectors are only stored when
the field content is sored as well. Is that true?

Wulf


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to