Re: best practice handling html content

Ahmet Arslan Mon, 19 Apr 2010 09:20:23 -0700

> we want to index and search in our intranet documents.
> the field "body" contains html-tags.
> 
> in our schema.xml we have a fieldType text_de (see at the
> end of this mail) which uses charFilter
> solr.HTMLStripCharFilterFactory with index. 
> so this is no problem. the text is put into the index
> without any html. i can do search over this field, also html
> entities like &auml; for a german umlaut (ä) do work,
> &nbsp; are filtered out correct, support for german
> language etc.
> 
> so now comes the problem. the field body is defined like
> 
> <field name="body" type="text_de" indexed="true"
> stored="true" />
> 
> so we do index it and also store the content. on the result
> page when we are printing body or the highlighing on body we
> have all the html tags back. sounds correct, as the
> HTML-Filter only works on the indexing...
> 
> so my question is, how is the best way to handle this case?
> strip out all html before adding the document to the index.


I think this is the best way to do it if you want to display html-stripped 
content.  By doing so you will save disk space too. 

Similar discussion: http://search-lucene.com/m/hyKqg1MJEDL

Re: best practice handling html content

Reply via email to