Re: Indexing HTML Content

solr Thu, 22 May 2008 02:33:00 -0700

Hi,

Maybe this one?


http://htmlparser.sourceforge.net/

/Jimi

Quoting "McBride, John" <[EMAIL PROTECTED]>:

Hello,

In my application I wish to index articles which are stored in HTML
format.

Upon indexing these the html gets stored along with the content of the
article, which is undesirable.

Do you know of any common way of parsing the text content from HTML
before adding to SOLR?  I understand SOLR 1.3 has an HTML analyser, but
I am using SOLR 1.2 and won't use 1.3 until it's stable, so looking for
a solution to work on a batch of files before being added to SOLR.

Thanks,
John

Re: Indexing HTML Content

Reply via email to