On 6/22/06, Mike Klaas <[EMAIL PROTECTED]> wrote:
I have, like many people I'm certain, implemented field highlighting/summarizing in solr and am interested in contributing back a patch.
Great!
There are various ways highlighting could be integrated into solr, and so I'd like to open up discussion a bit on this front before proceeding.
Hopefully Erik is around to add his experiences on this front...
Current status: Highlighting is implemented as part of SolrPluginUtils and integrated in StandardRequestHandler and DisMax... It is capable of highlighting an arbitrary number of stored fields given a query.
It uses term vectors, if present, to speed up highlighting (else the stored field needs to be re-analyzed).
Does your current code already handle the re-analyze case?
The doc cache is used so the performance impact is (relatively) minimal. Issues: Output: Currently, the summary data is output as a separate element in the <response> element (like the debug data is currently). This is not hard to parse, but perhaps it would be more consistent to add it to the <doc> elements (seems like that would require a bit of hackery).
It does seem like it would be easier for clients to parse document associated data if it is included directly in the <doc> element. One way might be to create an extension point where Documents could be manipulated and fields could be added. This could also be useful for integrating with large stored fields that might not be kept in the index, but in a separate database instead. That brings up another point... the ability to highlight something that doesn't have termvectors stored, and doesn't have the field stored might be useful. The same interface used to add new fields above could potentially be used here as well.
Customization: Currently, the fields summarized, the number of fragments, and the Formatter can be customized as a RequestHandler parameters. This isn't really optimal--if a field is summarized/highlighted, it is usually done in the same manner (and different fields require different Formatter/Fragmenter/Scorer criteria). Ideally, the customization should be done in the FieldType, and the only RequestHandler customization is the selection of which fields to highlight.
I'm not sure if this is really the property of a field. Another possibility is using init params in the request handler defined in solrconfig.xml, with the possibility of overriding them in a request.
Highlighter issues: Highlighter behaves badly with analyzers which emit multiple tokens in the same position (ie. WordDelimiterFilter).
File a Lucene bug?
Thoughts? Plans? -Mike
-Yonik