Hi Erick, Thank you very much for the explanation. The example you gave made things clear. I ran some queries with my existing index and the results were as expected.
Regards, Indika On 27 March 2010 17:09, Erick Erickson <erickerick...@gmail.com> wrote: > I think you're getting confused by the difference between indexing and > storing. These are orthogonal operations for all they occur in the same > definition. > > When you index something, the input is put through your analyzer chain, and > the resulting tokens are stored after all appropriate transformations, > which > is what you're seeing when you look at your index through the admin panel > and report the html is stripped. This is what's searched. > > But when you fetch a field that has been stored, the original raw text is > returned. This is never searched, just kept around for retrieval. > > The idea here is to be able to have your index contain some displayable > text. Think about the title of a book, for instance "The Grapes of Wrath". > You want to search it after it's been lower-cased, stop words removed, etc. > But if you wanted to present it to a user, you sure wouldn't want to > display > "grapes wrath" which might be the tokens after lowercasing and removing > stopwords.. > > HTH > Erick > > On Sat, Mar 27, 2010 at 1:13 AM, Indika Tantrigoda <indik...@gmail.com > >wrote: > > > Hello to all, > > > > I've been working with Solr for a few weeks and I have gotten indexing > and > > searching to work. > > However I am having trouble with indexing HTML content and using > > HTMLStripCharFilterFactory. > > > > My schema.xml looks like this > > > > <fieldType name="text" class="solr.TextField" > positionIncrementGap="100"> > > <analyzer type="index"> > > <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/> > > ------ > > --------/> > > > > and I am indexing the HTML content using SolrJ as the client (with Spring > > being the framework). > > > > However when I do a search for all documents, the HTML content is also in > > my > > text field. > > > > But when I did an analysis using the Solr admin panel with HTML content > it > > shows the tokens extracted > > properly with HTML tags removed. > > > > I found a similar issue at > > http://www.mail-archive.com/solr-user@lucene.apache.org/msg28736.html > > but I am still unable to get it working. I am using Solr 1.4 > > > > Any help regarding this is this much appreciated. > > > > Thanks in advance. > > > > Regards, > > Indika > > >