Re: Indexing HTML Content

2008-05-22 Thread solr
Hi, Maybe this one? http://htmlparser.sourceforge.net/ /Jimi Quoting McBride, John [EMAIL PROTECTED]: Hello, In my application I wish to index articles which are stored in HTML format. Upon indexing these the html gets stored along with the content of the article, which is undesirable.

Re: Indexing HTML Content

2008-05-22 Thread David Arpad Geller
Actually, it's very easy: http://us2.php.net/strip_tags I also store the data in a separate field with the html intact for display. In that case, I use urlencode on the string. David McBride, John wrote: Hello, In my application I wish to index articles which are stored in HTML format.

RE: Indexing HTML Content

2008-05-22 Thread Lance Norskog
the amount of string processing it does, the fact that it is a Reader probably does not affect its performance. Cheers, Lance -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Thursday, May 22, 2008 10:14 AM To: solr-user@lucene.apache.org Subject: Re: Indexing HTML

Re: Indexing HTML content... (Embed HTML into XML?)

2007-08-22 Thread Jérôme Etévé
You need to encode your html content so it can be include as a normal 'string' value in your xml element. As far as remember, the only unsafe characters you have to encode as entities are: - lt; - gt; - quote; - amp; (google xml entities to be sure). I dont know what language you use , but