Hi, Maybe this one?
http://htmlparser.sourceforge.net/ /Jimi Quoting "McBride, John" <[EMAIL PROTECTED]>:
Hello, In my application I wish to index articles which are stored in HTML format. Upon indexing these the html gets stored along with the content of the article, which is undesirable. Do you know of any common way of parsing the text content from HTML before adding to SOLR? I understand SOLR 1.3 has an HTML analyser, but I am using SOLR 1.2 and won't use 1.3 until it's stable, so looking for a solution to work on a batch of files before being added to SOLR. Thanks, John