On 4-Oct-07, at 3:19 PM, Adrian Sutton wrote:
I see that you're using the HTML analyzer. Unfortunately that
does not play very well with highlighting at the moment. You may
get garbled output.
Is it the HTML analyzer or the fact that it's HTML content? If it's
just the analyzer you could always just copy the HTML content to
another field with a different analyzer and use that for
highlighting (but search on the original field). Would this work,
and if so, which analyzer would be suitable for the second field?
the HTML analyzer strips html but doesn't update the offsets nicely
(the highlighter uses these to determine where to insert the <em> tags).
If you use a "normal" analyzer (like WordDelimiterFilter) directly on
the HTML, the offsets will be correct but you will get HTML tags
returned in your output, which you will have to be careful to strip.
(which means you couldn't use the default '<em>' as highlighting
markers).
In general, I don't recommend indexing HTML content straight to
Solr. None of the Solr contributors do this so the use case hasn't
received a lot of love.
I'm actually somewhat surprised that several people are interested in
this but none have have been sufficiently interested to implement a
solution to contribute:
http://issues.apache.org/jira/browse/SOLR-42
-Mike