> Well what happens is if I use a SpanScorer instead, and allocate it like
> > such:
> >
> > analyzer = StandardAnalyzer([])
> > tokenStream = analyzer.tokenStream("contents",
> > lucene.StringReader(text))
> > ctokenStream = lucene.CachingTokenFilter(tokenStream)
> > highlighter = lucene.Highlighter(formatter,
> > lucene.HighlighterSpanScorer(self.query, "contents", ctokenStream))
> > ctokenStream.reset()
> >
> > result = highlighter.getBestFragments(ctokenStream, text,
> > 2, "...")
> >
> > My highlighter is still breaking up words inside of a span. For
> example,
> > if I search for \"John Smith\", instead of the highlighter being called
> for
> > the whole "John Smith", it gets called for "John" and then "Smith".
>
> I think you need to use SimpleSpanFragmenter (vs SimpleFragmenter,
> which is the default used by Highlighter) to ensure that each fragment
> contains a full match for the query. EG something like this (copied
> from LIA 2nd edition):
>
> TermQuery query = new TermQuery(new Term("field", "fox"));
>
> TokenStream tokenStream =
> new SimpleAnalyzer().tokenStream("field",
> new StringReader(text));
>
> SpanScorer scorer = new SpanScorer(query, "field",
> new CachingTokenFilter(tokenStream));
> Fragmenter fragmenter = new SimpleSpanFragmenter(scorer);
> Highlighter highlighter = new Highlighter(scorer);
> highlighter.setTextFragmenter(fragmenter);
Okay, I hacked something up in Java that illustrates my issue.
import org.apache.lucene.search.*;
import org.apache.lucene.analysis.*;
import org.apache.lucene.document.*;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.search.highlight.*;
import org.apache.lucene.search.spans.SpanTermQuery;
import java.io.Reader;
import java.io.StringReader;
public class PhraseTest {
private IndexSearcher searcher;
private RAMDirectory directory;
public PhraseTest() throws Exception {
directory = new RAMDirectory();
Analyzer analyzer = new Analyzer() {
public TokenStream tokenStream(String fieldName, Reader reader)
{
return new WhitespaceTokenizer(reader);
}
public int getPositionIncrementGap(String fieldName) {
return 100;
}
};
IndexWriter writer = new IndexWriter(directory, analyzer, true,
IndexWriter.MaxFieldLength.LIMITED);
Document doc = new Document();
String text = "Jimbo John is his name";
doc.add(new Field("contents", text, Field.Store.YES,
Field.Index.ANALYZED));
writer.addDocument(doc);
writer.optimize();
writer.close();
searcher = new IndexSearcher(directory);
// Try a phrase query
PhraseQuery phraseQuery = new PhraseQuery();
phraseQuery.add(new Term("contents", "Jimbo"));
phraseQuery.add(new Term("contents", "John"));
// Try a SpanTermQuery
SpanTermQuery spanTermQuery = new SpanTermQuery(new Term("contents",
"Jimbo John"));
// Try a parsed query
Query parsedQuery = new QueryParser("contents",
analyzer).parse("\"Jimbo John\"");
Hits hits = searcher.search(parsedQuery);
System.out.println("We found " + hits.length() + " hits.");
// Highlight the results
CachingTokenFilter tokenStream = new
CachingTokenFilter(analyzer.tokenStream( "contents", new
StringReader(text)));
SimpleHTMLFormatter formatter = new SimpleHTMLFormatter();
SpanScorer sc = new SpanScorer(parsedQuery, "contents", tokenStream,
"contents");
Highlighter highlighter = new Highlighter(formatter, sc);
highlighter.setTextFragmenter(new SimpleSpanFragmenter(sc));
tokenStream.reset();
String rv = highlighter.getBestFragments(tokenStream, text, 1,
"...");
System.out.println(rv);
}
public static void main(String[] args) {
System.out.println("Starting...");
try {
PhraseTest pt = new PhraseTest();
} catch(Exception ex) {
ex.printStackTrace();
}
}
}
The output I'm getting is instead of highlighting <B>Jimbo John</B> it does
<B>Jimbo</B> then <B>John</B>. Can I get around this some how? I tried
several different query types (they are declared in the code, but only the
parsed version is being used).
Thanks
-max