Re: which HTML parser is better?

sergiu gordea Wed, 02 Feb 2005 06:28:54 -0800

Karl Koch wrote:

Hi,
yes, but the library your are using is quite big. I was thinking that a 5kB code could actually do that. That sourceforge project is doing much more than that but I do not need it.

you need just the htmlparser.jar 200k.
... you know ... the functionality is strongly correclated with the size.

You can use 3 lines of code with a good regular expresion to eliminate the html tags, but this won't give you any guarantie that the text from the bad fromated html files will be correctly extracted...

 Best,

 Sergiu

Karl
 Hi Karl,
I already submitted a peace of code that removes the html tags.
Search for my previous answer in this thread.
 Best,
  Sergiu
Karl Koch wrote:
Hello,
I have been following this thread and have another question.

Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc.

I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content
before
indexing that content as a whole.
Karl
I think that depends on what you want to do. The Lucene demo parser
does

simple mapping of HTML files into Lucene Documents; it does not give you

a
parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses
the
same API; will likely become part of Xerces), and so maps an HTML
document
into a full DOM that you can manipulate easily for a wide range of purposes. I haven't used JTidy at an API level and so don't know it as
well --
based on its UI, it appears to be focused primarily on HTML validation
and
error detection/correction.
I use CyberNeko for a range of operations on HTML documents that go
beyond

indexing them in Lucene, and really like it. It has been robust for me

so
far.
Chuck
> -----Original Message----- > From: Jingkang Zhang [mailto:[EMAIL PROTECTED] > Sent: Tuesday, February 01, 2005 1:15 AM > To: lucene-user@jakarta.apache.org > Subject: which HTML parser is better? > > Three HTML parsers(Lucene web application > demo,CyberNeko HTML Parser,JTidy) are mentioned in > Lucene FAQ > 1.3.27.Which is the best?Can it filter tags that are > auto-created by MS-word 'Save As HTML files' function? > > _________________________________________________________ > Do You Yahoo!? > 150万曲MP3疯狂搜，带您闯入音乐殿堂 > http://music.yisou.com/ > 美女明星应有尽有，搜遍美图、艳图和酷图 > http://image.yisou.com > 1G就是1000兆，雅虎电邮自助扩容！ > http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma > il_1g/ > >
---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: which HTML parser is better?

Reply via email to