If you are not married to Java: http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm
Otis --- sergiu gordea <[EMAIL PROTECTED]> wrote: > Karl Koch wrote: > > >I am in control of the html, which means it is well formated HTML. I > use > >only HTML files which I have transformed from XML. No external HTML > (e.g. > >the web). > > > >Are there any very-short solutions for that? > > > > > if you are using only correct formated HTML pages and you are in > control > of these pages. > you can use a regular exprestion to remove the tags. > > something like > replaceAll("<*>",""); > > This is the ideea behind the operation. If you will search on google > you > will find a more robust > regular expression. > > Using a simple regular expression will be a very cheap solution, that > > can cause you a lot of problems in the future. > > It's up to you to use it .... > > Best, > > Sergiu > > >Karl > > > > > > > >>Karl Koch wrote: > >> > >> > >> > >>>Hi, > >>> > >>>yes, but the library your are using is quite big. I was thinking > that a > >>> > >>> > >>5kB > >> > >> > >>>code could actually do that. That sourceforge project is doing > much more > >>>than that but I do not need it. > >>> > >>> > >>> > >>> > >>you need just the htmlparser.jar 200k. > >>... you know ... the functionality is strongly correclated with the > size. > >> > >> You can use 3 lines of code with a good regular expresion to > eliminate > >>the html tags, > >>but this won't give you any guarantie that the text from the bad > >>fromated html files will be > >>correctly extracted... > >> > >> Best, > >> > >> Sergiu > >> > >> > >> > >>>Karl > >>> > >>> > >>> > >>> > >>> > >>>> Hi Karl, > >>>> > >>>>I already submitted a peace of code that removes the html tags. > >>>>Search for my previous answer in this thread. > >>>> > >>>> Best, > >>>> > >>>> Sergiu > >>>> > >>>>Karl Koch wrote: > >>>> > >>>> > >>>> > >>>> > >>>> > >>>>>Hello, > >>>>> > >>>>>I have been following this thread and have another question. > >>>>> > >>>>>Is there a piece of sourcecode (which is preferably very short > and > >>>>> > >>>>> > >>simple > >> > >> > >>>>>(KISS)) which allows to remove all HTML tags from HTML content? > HTML > >>>>> > >>>>> > >>3.2 > >> > >> > >>>>>would be enough...also no frames, CSS, etc. > >>>>> > >>>>>I do not need to have the HTML strucutre tree or any other > structure > >>>>> > >>>>> > >>but > >> > >> > >>>>>need a facility to clean up HTML into its normal underlying > content > >>>>> > >>>>> > >>>>> > >>>>> > >>>>before > >>>> > >>>> > >>>> > >>>> > >>>>>indexing that content as a whole. > >>>>> > >>>>>Karl > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>>I think that depends on what you want to do. The Lucene demo > parser > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>does > >>>> > >>>> > >>>> > >>>> > >>>>>>simple mapping of HTML files into Lucene Documents; it does not > give > >>>>>> > >>>>>> > >>you > >> > >> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>a > >>>> > >>>> > >>>> > >>>> > >>>>>>parse tree for the HTML doc. CyberNeko is an extension of > Xerces > >>>>>> > >>>>>> > >>(uses > >> > >> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>the > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>>same API; will likely become part of Xerces), and so maps an > HTML > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>document > >>>> > >>>> > >>>> > >>>> > >>>>>>into a full DOM that you can manipulate easily for a wide range > of > >>>>>>purposes. I haven't used JTidy at an API level and so don't > know it > >>>>>> > >>>>>> > >>as > >> > >> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>well -- > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>>based on its UI, it appears to be focused primarily on HTML > validation > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>and > >>>> > >>>> > >>>> > >>>> > >>>>>>error detection/correction. > >>>>>> > >>>>>>I use CyberNeko for a range of operations on HTML documents > that go > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>beyond > >>>> > >>>> > >>>> > >>>> > >>>>>>indexing them in Lucene, and really like it. It has been > robust for > >>>>>> > >>>>>> > >>me > >> > >> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>so > >>>> > >>>> > >>>> > >>>> > >>>>>>far. > >>>>>> > >>>>>>Chuck > >>>>>> > >>>>>> > >>>>>> > >>>>>>>-----Original Message----- > >>>>>>>From: Jingkang Zhang [mailto:[EMAIL PROTECTED] > >>>>>>>Sent: Tuesday, February 01, 2005 1:15 AM > >>>>>>>To: lucene-user@jakarta.apache.org > >>>>>>>Subject: which HTML parser is better? > >>>>>>> > >>>>>>>Three HTML parsers(Lucene web application > >>>>>>>demo,CyberNeko HTML Parser,JTidy) are mentioned in > >>>>>>>Lucene FAQ > >>>>>>>1.3.27.Which is the best?Can it filter tags that are > >>>>>>>auto-created by MS-word 'Save As HTML files' function? > >>>>>>> > >>>>>>>_________________________________________________________ > >>>>>>>Do You Yahoo!? > >>>>>>>150万曲MP3疯狂搜,带您闯入音乐殿堂 > >>>>>>>http://music.yisou.com/ > >>>>>>>美女明星应有尽有,搜遍美图、艳图和酷图 > >>>>>>>http://image.yisou.com > >>>>>>>1G就是1000兆,雅虎电邮自助扩容! > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma > >>>>> > >>>>> > >>>>>>>il_1g/ > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>--------------------------------------------------------------------- > >>>> > >>>> > >>>> > >>>> > >>>>>>>To unsubscribe, e-mail: > [EMAIL PROTECTED] > >>>>>>>For additional commands, e-mail: > >>>>>>> > >>>>>>> > >>[EMAIL PROTECTED] > >> > >> > >>>>>>--------------------------------------------------------------------- > >>>>>>To unsubscribe, e-mail: > [EMAIL PROTECTED] > >>>>>>For additional commands, e-mail: > [EMAIL PROTECTED] > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>--------------------------------------------------------------------- > >>>>To unsubscribe, e-mail: > [EMAIL PROTECTED] > >>>>For additional commands, e-mail: > [EMAIL PROTECTED] > >>>> > >>>> > >>>> > >>>> > >>>> > >>> > >>> > >>> > >>> > >>--------------------------------------------------------------------- > >>To unsubscribe, e-mail: [EMAIL PROTECTED] > >>For additional commands, e-mail: > [EMAIL PROTECTED] > >> > >> > >> > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]