I think you're getting confused by the difference between indexing and
storing. These are orthogonal operations for all they occur in the same
definition.

When you index something, the input is put through your analyzer chain, and
the resulting tokens are stored after all appropriate transformations, which
is what you're seeing when you look at your index through the admin panel
and report the html is stripped. This is what's searched.

But when you fetch a field that has been stored, the original raw text is
returned. This is never searched, just kept around for retrieval.

The idea here is to be able to have your index contain some displayable
text. Think about the title of a book, for instance "The Grapes of Wrath".
You want to search it after it's been lower-cased, stop words removed, etc.
But if you wanted to present it to a user, you sure wouldn't want to display
"grapes wrath" which might be the tokens after lowercasing and removing
stopwords..

HTH
Erick

On Sat, Mar 27, 2010 at 1:13 AM, Indika Tantrigoda <indik...@gmail.com>wrote:

> Hello to all,
>
> I've been working with Solr for a few weeks and I have gotten indexing and
> searching to work.
> However I am having trouble with indexing HTML content and using
> HTMLStripCharFilterFactory.
>
> My schema.xml looks like this
>
>  <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>      <analyzer type="index">
>         <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
>      ------
>  --------/>
>
> and I am indexing the HTML content using SolrJ as the client (with Spring
> being the framework).
>
> However when I do a search for all documents, the HTML content is also in
> my
> text field.
>
> But when I did an analysis using the Solr admin panel with HTML content it
> shows the tokens extracted
> properly with HTML tags removed.
>
> I found a similar issue at
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg28736.html
> but I am still unable to get it working. I am using Solr 1.4
>
> Any help regarding this is this much appreciated.
>
> Thanks in advance.
>
> Regards,
> Indika
>

Reply via email to