HTMLParser

2006-07-13 Thread Ross Rankin
Since I cannot seem to access the HTMLParser mailing list and I saw the library recommended here, I thought someone here that has used it successfully can help me out. I have HTML text stored in a database field which I want to add to a Lucene document, but I want to remove the HTML tags, so

Re: HTMLParser

2006-07-13 Thread Yonik Seeley
I've never used HTMLParser, but if you have malformed., incomplete, or optional HTML that would otherwise choke an HTML parser, you could use Solr's HTMLStripping: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#head-031d5d370010955fdcc529d208395cd556f4a73e It's pre

RE: HTMLParser

2006-07-14 Thread Ross Rankin
rsday, July 13, 2006 4:34 PM To: java-user Subject: HTMLParser Since I cannot seem to access the HTMLParser mailing list and I saw the library recommended here, I thought someone here that has used it successfully can help me out. I have HTML text stored in a database field which I want to add to

Re: HTMLParser

2006-07-15 Thread Charles Bell
ing + new String(ch,start,length); } } --- Ross Rankin <[EMAIL PROTECTED]> wrote: > Since I cannot seem to access the HTMLParser mailing > list and I saw the > library recommended here, I thought someone here > that has used it > successfully can help me out.

HTMLParser and Chinese

2007-09-14 Thread Jennifer May
), IndexHTML.encoding); HTMLParser parser = new HTMLParser(fis); It works fine for most of my files in GB, Big5 or UTF-8. However, I get the following exception for some of my files: Parse Aborted: Lexical error at line 6, column 24. Encountered: "\u4f53" (20307), after : "" The