Hi,
Maybe this one?
http://htmlparser.sourceforge.net/
/Jimi
Quoting McBride, John [EMAIL PROTECTED]:
Hello,
In my application I wish to index articles which are stored in HTML
format.
Upon indexing these the html gets stored along with the content of the
article, which is undesirable.
Actually, it's very easy: http://us2.php.net/strip_tags
I also store the data in a separate field with the html intact for
display. In that case, I use urlencode on the string.
David
McBride, John wrote:
Hello,
In my application I wish to index articles which are stored in HTML
format.
the amount of
string processing it does, the fact that it is a Reader probably does not
affect its performance.
Cheers,
Lance
-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 22, 2008 10:14 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing HTML
You need to encode your html content so it can be include as a normal
'string' value in your xml element.
As far as remember, the only unsafe characters you have to encode as
entities are:
- lt;
- gt;
- quote;
- amp;
(google xml entities to be sure).
I dont know what language you use , but