Thanks all for help.

Just to make sure I understand correctly, am I right in summarizing
this way than?:

No significance of using HTML: Unlike nutch Solr doesn't parse HTML,
so it ignores the anchors, titles etc and is not good for page rank
-esq indexing.

HTMLAnalyser (by with you probably mean HTMLStripWhitespaceTokenizer?)
: Main purpose is to allow users to index html code, it will strip the
html tags and index the contents, but if used for getting snippets in
results the <em> tags may be in wrong locations

To avoid using HTMLAnalyser, strip out the tags yourself and only send
text to Solr for indexing using one of the "normal" analysers.
Highlighting should be accurate in this case.

(Query esp. Adrian):

If you are indexing XHTML, do you replace tags with entities before
giving it to solr, if so, when you get back snippets do you get tags
or entities or do you convert again to tags for presentation?  What's
the best way out?  It would help me a lot if you briefly explain your
configuration.

Do let me know if my assumptions are wrong!

Cheers,
Ravish

On 10/5/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
>
> : In general, I don't recommend indexing HTML content straight to Solr.  None 
> of
> : the Solr contributors do this so the use case hasn't received a lot of love.
>
> I second that comment ... the HTML Striping code was never intended to be
> an "HTML Parser" it was designed to be a workarround for dealing with
> "dirty data" where people had unwanted HTML tags in what should be plain
> text.  indexing as is with some analyzers would result in words like
> "script", "strong", and "class" matching lots of docs where the words
> never relaly appear in the text.
>
> if you have wellformed HTML documents, use an HTML parser to extract the
> real content.
>
>
>
> -Hoss
>
>

Reply via email to