I've never used HTMLParser, but if you have malformed., incomplete, or
optional HTML that would otherwise choke an HTML parser, you could use
Solr's HTMLStripping:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-031d5d370010955fdcc529d208395cd556f4a73e

It's pretty stand-alone, so it should be trivial to rip it out of Solr
and re-use it in your Lucene project.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server


On 7/13/06, Ross Rankin <[EMAIL PROTECTED]> wrote:
Since I cannot seem to access the HTMLParser mailing list and I saw the
library recommended here, I thought someone here that has used it
successfully can help me out.

I have HTML text stored in a database field which I want to add to a
Lucene document, but I want to remove the HTML tags, so HTMLParser
looked like it would fit the bill.



First, it does not seem to be parsing… hence my first problem and it
also is throwing an exception along with this phrase sprinkled around
"(No such file or directory)".



I think I may be using it wrong, so here's what I have done.  In my
object where I create my document, I have the following code:

        StringExtractor extract = new
StringExtractor(record.get("column14").toString().trim());

        try {

            value = extract.extractStrings(false);

        } catch (ParserException pe) {

            System.out.println("Index Long Description Parser
Exception:" + pe.getMessage() );

            value = "";

        }



What I get out in value is like the following:

<LI><FONT size=2>Crystal Clear III and 3D combfilter for natural, sharp
images with enhanced quality </FONT>

<LI><FONT size=2>Compact and sleek design </FONT>

<LI><FONT size=2>Incredible Surround (No such file or directory)



So the tags are still there and oddly the '(No such file or directory)'
phrase is added which is not in the original text.



Then I get a ParserException.



What am I doing wrong?



Thanks,

Ross

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to