Hi,

Maybe this one?

http://htmlparser.sourceforge.net/

/Jimi

Quoting "McBride, John" <[EMAIL PROTECTED]>:

Hello,

In my application I wish to index articles which are stored in HTML
format.

Upon indexing these the html gets stored along with the content of the
article, which is undesirable.

Do you know of any common way of parsing the text content from HTML
before adding to SOLR?  I understand SOLR 1.3 has an HTML analyser, but
I am using SOLR 1.2 and won't use 1.3 until it's stable, so looking for
a solution to work on a batch of files before being added to SOLR.

Thanks,
John



Reply via email to