Hi Karl,

I already submitted a peace of code that removes the html tags.
Search for my previous answer in this thread.

 Best,

  Sergiu

Karl Koch wrote:

Hello,

I have been following this thread and have another question.

Is there a piece of sourcecode (which is preferably very short and simple
(KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2
would be enough...also no frames, CSS, etc.


I do not need to have the HTML strucutre tree or any other structure but
need a facility to clean up HTML into its normal underlying content before
indexing that content as a whole.

Karl




I think that depends on what you want to do. The Lucene demo parser does
simple mapping of HTML files into Lucene Documents; it does not give you a
parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses


the


same API; will likely become part of Xerces), and so maps an HTML document
into a full DOM that you can manipulate easily for a wide range of
purposes. I haven't used JTidy at an API level and so don't know it as


well --


based on its UI, it appears to be focused primarily on HTML validation and
error detection/correction.

I use CyberNeko for a range of operations on HTML documents that go beyond
indexing them in Lucene, and really like it.  It has been robust for me so
far.

Chuck

> -----Original Message-----
> From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, February 01, 2005 1:15 AM
> To: lucene-user@jakarta.apache.org
> Subject: which HTML parser is better?
> > Three HTML parsers(Lucene web application
> demo,CyberNeko HTML Parser,JTidy) are mentioned in
> Lucene FAQ
> 1.3.27.Which is the best?Can it filter tags that are
> auto-created by MS-word 'Save As HTML files' function?
> > _________________________________________________________
> Do You Yahoo!?
> 150万曲MP3疯狂搜,带您闯入音乐殿堂
> http://music.yisou.com/
> 美女明星应有尽有,搜遍美图、艳图和酷图
> http://image.yisou.com
> 1G就是1000兆,雅虎电邮自助扩容!
>
http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
> il_1g/
> > ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]



--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]








---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to