Re: PDF Highlighting using PDF Highlight File

Wulf Berschin Thu, 12 May 2011 07:47:45 -0700

Well, AFAIS the Lucene Highlighters do not offer this functionality viatheir API, but could easily do.

I think support for highlighting documents would be a very welcomefeature. Highlighting HTML documents is already possible with theorg.apache.solr.analysis.HTMLStripCharFilter and a NullFragmenter, butther seems to be nothing for highlighting PDF files...

As starting point I quarried out fromorg.apache.lucene.search.highlight.Highlighter the class below whichjust returns the Tokens contributing to the hit.

Using the returned tokens a PDF highlight file could be easily generatedand voilà..


-- Wulf

package org.apache.lucene.search.highlight;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;

importorg.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;

import org.apache.lucene.analysis.tokenattributes.TermAttribute;


public class HighlightTokensExtractor
{
  private Scorer fragmentScorer = null;

  public HighlightTokensExtractor(Scorer fragmentScorer)
  {
    this.fragmentScorer = fragmentScorer;
  }

  public final List<Token> getTokens(TokenStream tokenStream, String text,
      boolean mergeContiguousFragments, int maxNumFragments)
      throws IOException, InvalidTokenOffsetsException
  {
    List<Token> result = new ArrayList<Token>();
    TermAttribute termAtt = tokenStream.addAttribute(TermAttribute.class);

OffsetAttribute offsetAtt =tokenStream.addAttribute(OffsetAttribute.class);

    tokenStream.addAttribute(PositionIncrementAttribute.class);
    tokenStream.reset();

    // dummy text fragment
    TextFragment currentFrag = new TextFragment("", 0, 0);
    TokenStream newStream = fragmentScorer.init(tokenStream);
    if (newStream != null) {
      tokenStream = newStream;
    }
    fragmentScorer.startFragment(currentFrag);

    try {

      TokenGroup tokenGroup = new TokenGroup(tokenStream);

for (boolean next = tokenStream.incrementToken(); next; next =tokenStream

          .incrementToken()) {
        if ((offsetAtt.endOffset() > text.length())
            || (offsetAtt.startOffset() > text.length())) {
          throw new InvalidTokenOffsetsException("Token " + termAtt.term()
              + " exceeds length of provided text sized " + text.length());
        }
        if ((tokenGroup.numTokens > 0) && (tokenGroup.isDistinct())) {

          if (tokenGroup.getTotalScore() > 0) {
            System.out.println(tokenGroup.matchStartOffset + " "
                + tokenGroup.matchEndOffset);

result.add((Token)tokenGroup.getToken(tokenGroup.getNumTokens()-1));
          }
          tokenGroup.clear();

        }
        tokenGroup.addToken(fragmentScorer.getTokenScore());

      }

      if (tokenGroup.numTokens > 0) {

        if (tokenGroup.getTotalScore() > 0) {
          System.out.println(tokenGroup.matchStartOffset + " "
              + tokenGroup.matchEndOffset);

result.add((Token)tokenGroup.getToken(tokenGroup.getNumTokens()-1));
        }
      }

      return result;

    }
    finally {
      if (tokenStream != null) {
        try {
          tokenStream.close();
        }
        catch (Exception e) {
        }
      }
    }
  }

}



Am 10.05.2011 12:32, schrieb Wulf Berschin:

Hi all,

in our Lucene 3.0.3-based web application when a user clicks on a hit
link the targeted PDF should be opened in the browser with highlighted
hits.

For this purpose using the Acrobat Highlight File (Parameter xml, see
http://www.pdfbox.org/userguide/highlighting.html and
http://partners.adobe.com/public/developer/en/pdf/HighlightFileFormat.pdf)
seems most reasonable to me.

Since the position to highlight are given by (page and) character
offsets and Lucene uses offsets as well I think it could be easy (for
more Lucene-skilled people than me) to create an Highlighter which
produces this highlight file.

Does such a Highlighter already exists in the Lucene World?

If not could someone please point me the direction (e.g. where to hook
into the existing (fast vector?) highlighter just to extract the offsets).

BTW: Luke gyve me the impression that Term Vectors are only stored when
the field content is sored as well. Is that true?

Wulf



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: PDF Highlighting using PDF Highlight File

Reply via email to