On 4-Oct-07, at 3:19 PM, Adrian Sutton wrote:

I see that you're using the HTML analyzer. Unfortunately that does not play very well with highlighting at the moment. You may get garbled output.

Is it the HTML analyzer or the fact that it's HTML content? If it's just the analyzer you could always just copy the HTML content to another field with a different analyzer and use that for highlighting (but search on the original field). Would this work, and if so, which analyzer would be suitable for the second field?

the HTML analyzer strips html but doesn't update the offsets nicely (the highlighter uses these to determine where to insert the <em> tags).

If you use a "normal" analyzer (like WordDelimiterFilter) directly on the HTML, the offsets will be correct but you will get HTML tags returned in your output, which you will have to be careful to strip. (which means you couldn't use the default '<em>' as highlighting markers).

In general, I don't recommend indexing HTML content straight to Solr. None of the Solr contributors do this so the use case hasn't received a lot of love.

I'm actually somewhat surprised that several people are interested in this but none have have been sufficiently interested to implement a solution to contribute:

http://issues.apache.org/jira/browse/SOLR-42

-Mike

Reply via email to