Oops. It's in the Google cache and also the Internet Archive Wayback
machine. I'll drop the original author a note to let him know that
his links are stale.
http://web.archive.org/web/20040208014740/http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/
Ian
"Karl Koch" <[EMAIL PROTECTED]> wri
The link does not work.
>
> One which we've been using can be found at:
> http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/
>
> We absolutely need to be able to recover gracefully from malformed
> HTML and/or SGML. Most of the nicer SAX/DOM/TLA parsers out there
> failed this criterion whe
One which we've been using can be found at:
http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/
We absolutely need to be able to recover gracefully from malformed
HTML and/or SGML. Most of the nicer SAX/DOM/TLA parsers out there
failed this criterion when we started our effort. The above one
For all parser suggestion I think there is one important attribute. Some
parsers returns data provide that the input HTML is sensible. Some parsers
is designed to be most flexible as tolerant as it can be. If the input is
clean and controlled the former class is sufficient. Even some regular
Thank you, I will do that.
> Karl Koch wrote:
>
> >I appologise in advance, if some of my writing here has been said before.
> >The last three answers to my question have been suggesting pattern
> matching
> >solutions and Swing. Pattern matching was introduced in Java 1.4 and
> Swing
> >is somet
Karl,
Two things, try to experiment with both:
1) I would try to write a lexical scanner that strips HTML tags, much
like the regular expression does. Java lexical scanner packages produce
nice pure Java classes that seldom use any advanced API, so they should
work on Java 1.1. They are simple s
Karl Koch wrote:
I appologise in advance, if some of my writing here has been said before.
The last three answers to my question have been suggesting pattern matching
solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing
is something I cannot use since I work with Java 1.1 on a
; >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>I think that depends on what you want to do.
-Original Message-
From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 01, 2005 1:15 AM
To: lucene-user@jakarta.apache.org
Subject: which HTML parser is better?
Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned
I appologise in advance, if some of my writing here has been said before.
The last three answers to my question have been suggesting pattern matching
solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing
is something I cannot use since I work with Java 1.1 on a PDA.
I am wonde
day, February 01, 2005 1:15 AM
To: lucene-user@jakarta.apache.org
Subject: which HTML parser is better?
Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-cr
> >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>does
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>>>simple mapping of HTML files into Lucene Docu
give
> >>>>>>
> >>>>>>
> >>you
> >>
> >>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>a
> >>
AM
> To: lucene-user@jakarta.apache.org
> Subject: which HTML parser is better?
>
> Three HTML parsers(Lucene web application
> demo,CyberNeko HTML Parser,JTidy) are mentioned in
> Lucene FAQ
> 1.3.27.Which is the best?Can it filter tags that are
> auto-cr
No one has yet mentioned using ParserDelegator and ParserCallback that
are part of HTMLEditorKit in Swing. I have been successfully using
these classes to parse out the text of an HTML file. You just need to
extend HTMLEditorKit.ParserCallback and override the various methods
that are called
-
> > > From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
> > > Sent: Tuesday, February 01, 2005 1:15 AM
> > > To: lucene-user@jakarta.apache.org
> > > Subject: which HTML parser is better?
> > >
> > > Three HTML parsers(Lucene web
"
Sent: Wednesday, February 02, 2005 1:23 PM
Subject: Re: which HTML parser is better?
> Karl Koch wrote:
>
> >I am in control of the html, which means it is well formated HTML. I use
> >only HTML files which I have transformed from XML. No external HTML (e.g.
> >th
; >>>>>>simple mapping of HTML files into Lucene Documents; it does not
> give
> >>>>>>
> >>>>>>
> >>you
> >>
> >>
> >>>>>>
> >>>>>>
> >>>>>>
>
ke it. It has been robust for
me
so
far.
Chuck
-Original Message-
From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 01, 2005 1:15 AM
To: lucene-user@jakarta.apache.org
Subject: which HTML parser is better?
;>>
> >>document
> >>
> >>
> >>>>into a full DOM that you can manipulate easily for a wide range of
> >>>>purposes. I haven't used JTidy at an API level and so don't know it
> as
> >>>>
> >>
AIL PROTECTED]
> Sent: Tuesday, February 01, 2005 1:15 AM
> To: lucene-user@jakarta.apache.org
> Subject: which HTML parser is better?
>
> Three HTML parsers(Lucene web application
> demo,CyberNeko HTML Parser,JTidy) are mentioned in
> Lucene FAQ
> 1.3.27.Which is the best?Can
and
> >>error detection/correction.
> >>
> >>I use CyberNeko for a range of operations on HTML documents that go
> beyond
> >>indexing them in Lucene, and really like it. It has been robust for me
> so
> >>far.
> >>
> >>Chuck
>
On Feb 2, 2005, at 6:17 AM, Karl Koch wrote:
Hello,
I have been following this thread and have another question.
Is there a piece of sourcecode (which is preferably very short and
simple
(KISS)) which allows to remove all HTML tags from HTML content? HTML
3.2
would be enough...also no frames, CS
ser@jakarta.apache.org
> Subject: which HTML parser is better?
>
> Three HTML parsers(Lucene web application
> demo,CyberNeko HTML Parser,JTidy) are mentioned in
> Lucene FAQ
> 1.3.27.Which is the best?Can it filter tags that are
> auto-creat
has been robust for me so
> far.
>
> Chuck
>
> > -Original Message-
> > From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, February 01, 2005 1:15 AM
> > To: lucene-user@jakarta.apache.org
> > Subject: which HTML parser is bet
ser@jakarta.apache.org
> Subject: which HTML parser is better?
>
> Three HTML parsers(Lucene web application
> demo,CyberNeko HTML Parser,JTidy) are mentioned in
> Lucene FAQ
> 1.3.27.Which is the best?Can it filter tags that are
> auto-creat
When I tested parsers a year or so ago for intensive use in Furl, the
best (tolerant of bad HTML) and fastest (tested on a 1.5M HTML page)
parser by far was TagSoup ( http://www.tagsoup.info ). It is actively
maintained and improved and I have never had any problems with it.
-Mike
Jingkang Zhang
Jingkang Zhang wrote:
>Three HTML parsers(Lucene web application
>demo,CyberNeko HTML Parser,JTidy) are mentioned in
>Lucene FAQ
>1.3.27.Which is the best?Can it filter tags that are
>auto-created by MS-word 'Save As HTML files' function?
>
>
maybe you can try this library...
http://htmlparser
Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?
_
Do You Yahoo!?
150万曲MP3疯狂搜,带
29 matches
Mail list logo