> we want to index and search in our intranet documents. > the field "body" contains html-tags. > > in our schema.xml we have a fieldType text_de (see at the > end of this mail) which uses charFilter > solr.HTMLStripCharFilterFactory with index. > so this is no problem. the text is put into the index > without any html. i can do search over this field, also html > entities like ä for a german umlaut (รค) do work, > are filtered out correct, support for german > language etc. > > so now comes the problem. the field body is defined like > > <field name="body" type="text_de" indexed="true" > stored="true" /> > > so we do index it and also store the content. on the result > page when we are printing body or the highlighing on body we > have all the html tags back. sounds correct, as the > HTML-Filter only works on the indexing... > > so my question is, how is the best way to handle this case? > strip out all html before adding the document to the index.
I think this is the best way to do it if you want to display html-stripped content. By doing so you will save disk space too. Similar discussion: http://search-lucene.com/m/hyKqg1MJEDL