Re: unable to figure out nutch type highlighting in solr....

Mike Klaas Thu, 04 Oct 2007 15:45:42 -0700

On 4-Oct-07, at 3:19 PM, Adrian Sutton wrote:

I see that you're using the HTML analyzer. Unfortunately thatdoes not play very well with highlighting at the moment. You mayget garbled output.
Is it the HTML analyzer or the fact that it's HTML content? If it'sjust the analyzer you could always just copy the HTML content toanother field with a different analyzer and use that forhighlighting (but search on the original field). Would this work,and if so, which analyzer would be suitable for the second field?

the HTML analyzer strips html but doesn't update the offsets nicely(the highlighter uses these to determine where to insert the <em> tags).

If you use a "normal" analyzer (like WordDelimiterFilter) directly onthe HTML, the offsets will be correct but you will get HTML tagsreturned in your output, which you will have to be careful to strip.(which means you couldn't use the default '<em>' as highlightingmarkers).

In general, I don't recommend indexing HTML content straight toSolr. None of the Solr contributors do this so the use case hasn'treceived a lot of love.

I'm actually somewhat surprised that several people are interested inthis but none have have been sufficiently interested to implement asolution to contribute:


http://issues.apache.org/jira/browse/SOLR-42

-Mike

Re: unable to figure out nutch type highlighting in solr....

Reply via email to