On 6/22/06, Mike Klaas <[EMAIL PROTECTED]> wrote:
I have, like many people I'm certain, implemented field
highlighting/summarizing in solr and am interested in contributing
back a patch.

Great!

There are various ways highlighting could be integrated
into solr, and so I'd like to open up discussion a bit on this front
before proceeding.

Hopefully Erik is around to add his experiences on this front...

Current status: Highlighting is implemented as part of SolrPluginUtils
and integrated in StandardRequestHandler and DisMax...  It is capable
of highlighting an arbitrary number of stored fields given a query.

It uses term vectors, if present, to speed up highlighting (else the
stored field needs to be re-analyzed).

Does your current code already handle the re-analyze case?

The doc cache is used so the
performance impact is (relatively) minimal.

Issues:

Output: Currently, the summary data is output as a separate element in
the <response> element (like the debug data is currently).  This is
not hard to parse, but perhaps it would be more consistent to add it
to the <doc> elements (seems like that would require a bit of
hackery).

It does seem like it would be easier for clients to parse document
associated data if it is included directly in the <doc> element.

One way might be to create an extension point where Documents could be
manipulated and fields could be added.  This could also be useful for
integrating with large stored fields that might not be kept in the
index, but in a separate database instead.

That brings up another point... the ability to highlight something
that doesn't have termvectors stored, and doesn't have the field
stored might be useful.  The same interface used to add new fields
above could potentially be used here as well.

Customization: Currently, the fields summarized, the number of
fragments, and the Formatter can be customized as a RequestHandler
parameters.  This isn't really optimal--if a field is
summarized/highlighted, it is usually done in the same manner (and
different fields require different Formatter/Fragmenter/Scorer
criteria).  Ideally, the customization should be done in the
FieldType, and the only RequestHandler customization is the selection
of which fields to highlight.

I'm not sure if this is really the property of a field.
Another possibility is using init params in the request handler
defined in solrconfig.xml, with the possibility of overriding them in
a request.

Highlighter issues: Highlighter behaves badly with analyzers which
emit multiple tokens in the same position (ie. WordDelimiterFilter).

File a Lucene bug?

Thoughts? Plans?
-Mike

-Yonik

Reply via email to