Nicko Cadell was good enough to point out the issues involved with generating XHTML compliant markup with the highlighter and provided a patch to fix it.

The main code has now been updated in the new SVN repository here: http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/highlighter/

To encode your content simply pass an encoder to the Highlighter eg:


//create an example doc for this test String myDocContent = "\"Smith & sons' prices < 3 and >4\" claims article"; //Ordinarily you'd get the doc content like this..
//myDocContent=hits.doc(i).get(FIELD_NAME)


      //create a query - you'd normally get this from QueryParser.parse
       Query myDocQuery=new TermQuery(new Term("contents","prices"));

//Create a highlighter and pass a QueryScorer to provide the list of query tokens Highlighter highlighter = new Highlighter(new QueryScorer(myDocQuery));
//set the choice of encoder to our simple encoder - otherwise default is no encoding
highlighter.setEncoder(new SimpleHTMLEncoder());
//Tokenize the document content to get the positions using an analyzer:
Analyzer analyzer=new WhitespaceAnalyzer();
TokenStream tokenStream = analyzer.tokenStream("contents", new StringReader(myDocContent));
//As a faster alternative to re-analyzing doc content you can
//use "TokenSources" to take advantage of any pre-tokenized content held in any term vectors:
//TokenStream tokenStream=TokenSources.getAnyTokenStream(indexReader,docId, fieldName,analyzer);
//Now pass the tokenStream to the highlighter to process
String encodedSnippet = highlighter.getBestFragments(tokenStream, myDocContent,1,"...");
System.out.println(encodedSnippet);
//Should print &quot;Smith &amp; sons' <B>prices</B> &lt; 3 and &gt;4&quot; claims article


Cheers
Mark




--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to