On 3-Oct-07, at 3:26 AM, Ravish Bhagdev wrote:


Because of this I cannot present the resulting html in a webpage.  Is
it possible to strip out all HTML tags completely in result set?
Would you recommend sending stripped out text to solr instead?  But
doesn't Solr use HTML features while searching (anchors/titles etc).

Why is there no documentation about indexing HTML specifically using
solr.  How does nutch do it?  does it strip out html in the snippets
it returns?

Solr isn't a web search engine, and doesn't do any special processing of html (although you can ask it to strip html if you want).

I recommend stripping the html yourself, and putting titles, anchors, etc in separate fields.

I believe that it would be possible to write this as a Solr update- handler plugin, if you wanted it to all run in one place.

-Mike

Reply via email to