Highlighter: new support for encoding

markharw00d Sun, 06 Feb 2005 14:11:39 -0800

Nicko Cadell was good enough to point out the issues involved with generating XHTML compliant markup with the highlighter and provided a patch to fix it.

The main code has now been updated in the new SVN repository here: http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/highlighter/

To encode your content simply pass an encoder to the Highlighter eg:

//create an example doc for this test String myDocContent = "\"Smith & sons' prices < 3 and >4\" claims article"; //Ordinarily you'd get the doc content like this.. //myDocContent=hits.doc(i).get(FIELD_NAME)

      //create a query - you'd normally get this from QueryParser.parse
       Query myDocQuery=new TermQuery(new Term("contents","prices"));

//Create a highlighter and pass a QueryScorer to provide the list of query tokens Highlighter highlighter = new Highlighter(new QueryScorer(myDocQuery)); //set the choice of encoder to our simple encoder - otherwise default is no encoding highlighter.setEncoder(new SimpleHTMLEncoder()); //Tokenize the document content to get the positions using an analyzer: Analyzer analyzer=new WhitespaceAnalyzer(); TokenStream tokenStream = analyzer.tokenStream("contents", new StringReader(myDocContent)); //As a faster alternative to re-analyzing doc content you can //use "TokenSources" to take advantage of any pre-tokenized content held in any term vectors: //TokenStream tokenStream=TokenSources.getAnyTokenStream(indexReader,docId, fieldName,analyzer); //Now pass the tokenStream to the highlighter to process String encodedSnippet = highlighter.getBestFragments(tokenStream, myDocContent,1,"..."); System.out.println(encodedSnippet); //Should print "Smith & sons' <B>prices</B> < 3 and >4" claims article

Cheers
Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Highlighter: new support for encoding

Reply via email to