I think you're getting confused by the difference between indexing and storing. These are orthogonal operations for all they occur in the same definition.
When you index something, the input is put through your analyzer chain, and the resulting tokens are stored after all appropriate transformations, which is what you're seeing when you look at your index through the admin panel and report the html is stripped. This is what's searched. But when you fetch a field that has been stored, the original raw text is returned. This is never searched, just kept around for retrieval. The idea here is to be able to have your index contain some displayable text. Think about the title of a book, for instance "The Grapes of Wrath". You want to search it after it's been lower-cased, stop words removed, etc. But if you wanted to present it to a user, you sure wouldn't want to display "grapes wrath" which might be the tokens after lowercasing and removing stopwords.. HTH Erick On Sat, Mar 27, 2010 at 1:13 AM, Indika Tantrigoda <indik...@gmail.com>wrote: > Hello to all, > > I've been working with Solr for a few weeks and I have gotten indexing and > searching to work. > However I am having trouble with indexing HTML content and using > HTMLStripCharFilterFactory. > > My schema.xml looks like this > > <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/> > ------ > --------/> > > and I am indexing the HTML content using SolrJ as the client (with Spring > being the framework). > > However when I do a search for all documents, the HTML content is also in > my > text field. > > But when I did an analysis using the Solr admin panel with HTML content it > shows the tokens extracted > properly with HTML tags removed. > > I found a similar issue at > http://www.mail-archive.com/solr-user@lucene.apache.org/msg28736.html > but I am still unable to get it working. I am using Solr 1.4 > > Any help regarding this is this much appreciated. > > Thanks in advance. > > Regards, > Indika >