Re: SolrJ and HTMLStripCharFilterFactory

Indika Tantrigoda Sat, 27 Mar 2010 08:55:18 -0700

Hi Erick,

Thank you very much for the explanation. The example you gave made things
clear. I ran some queries with my existing  index and the results were as
expected.


Regards,
Indika

On 27 March 2010 17:09, Erick Erickson <erickerick...@gmail.com> wrote:

> I think you're getting confused by the difference between indexing and
> storing. These are orthogonal operations for all they occur in the same
> definition.
>
> When you index something, the input is put through your analyzer chain, and
> the resulting tokens are stored after all appropriate transformations,
> which
> is what you're seeing when you look at your index through the admin panel
> and report the html is stripped. This is what's searched.
>
> But when you fetch a field that has been stored, the original raw text is
> returned. This is never searched, just kept around for retrieval.
>
> The idea here is to be able to have your index contain some displayable
> text. Think about the title of a book, for instance "The Grapes of Wrath".
> You want to search it after it's been lower-cased, stop words removed, etc.
> But if you wanted to present it to a user, you sure wouldn't want to
> display
> "grapes wrath" which might be the tokens after lowercasing and removing
> stopwords..
>
> HTH
> Erick
>
> On Sat, Mar 27, 2010 at 1:13 AM, Indika Tantrigoda <indik...@gmail.com
> >wrote:
>
> > Hello to all,
> >
> > I've been working with Solr for a few weeks and I have gotten indexing
> and
> > searching to work.
> > However I am having trouble with indexing HTML content and using
> > HTMLStripCharFilterFactory.
> >
> > My schema.xml looks like this
> >
> >  <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
> >      <analyzer type="index">
> >         <tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
> >      ------
> >  --------/>
> >
> > and I am indexing the HTML content using SolrJ as the client (with Spring
> > being the framework).
> >
> > However when I do a search for all documents, the HTML content is also in
> > my
> > text field.
> >
> > But when I did an analysis using the Solr admin panel with HTML content
> it
> > shows the tokens extracted
> > properly with HTML tags removed.
> >
> > I found a similar issue at
> > http://www.mail-archive.com/solr-user@lucene.apache.org/msg28736.html
> > but I am still unable to get it working. I am using Solr 1.4
> >
> > Any help regarding this is this much appreciated.
> >
> > Thanks in advance.
> >
> > Regards,
> > Indika
> >
>

Re: SolrJ and HTMLStripCharFilterFactory

Reply via email to