Re: which HTML parser is better?

2005-02-04 Thread Ian Soboroff
Oops. It's in the Google cache and also the Internet Archive Wayback machine. I'll drop the original author a note to let him know that his links are stale. http://web.archive.org/web/20040208014740/http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/ Ian "Karl Koch" <[EMAIL PROTECTED]> wri

Re: which HTML parser is better?

2005-02-04 Thread Karl Koch
The link does not work. > > One which we've been using can be found at: > http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/ > > We absolutely need to be able to recover gracefully from malformed > HTML and/or SGML. Most of the nicer SAX/DOM/TLA parsers out there > failed this criterion whe

Re: which HTML parser is better?

2005-02-03 Thread Ian Soboroff
One which we've been using can be found at: http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/ We absolutely need to be able to recover gracefully from malformed HTML and/or SGML. Most of the nicer SAX/DOM/TLA parsers out there failed this criterion when we started our effort. The above one

Re: which HTML parser is better?

2005-02-03 Thread aurora
For all parser suggestion I think there is one important attribute. Some parsers returns data provide that the input HTML is sensible. Some parsers is designed to be most flexible as tolerant as it can be. If the input is clean and controlled the former class is sufficient. Even some regular

Re: which HTML parser is better? - Thread closed

2005-02-03 Thread Karl Koch
Thank you, I will do that. > Karl Koch wrote: > > >I appologise in advance, if some of my writing here has been said before. > >The last three answers to my question have been suggesting pattern > matching > >solutions and Swing. Pattern matching was introduced in Java 1.4 and > Swing > >is somet

Re: which HTML parser is better?

2005-02-03 Thread Dawid Weiss
Karl, Two things, try to experiment with both: 1) I would try to write a lexical scanner that strips HTML tags, much like the regular expression does. Java lexical scanner packages produce nice pure Java classes that seldom use any advanced API, so they should work on Java 1.1. They are simple s

Re: which HTML parser is better?

2005-02-03 Thread sergiu gordea
Karl Koch wrote: I appologise in advance, if some of my writing here has been said before. The last three answers to my question have been suggesting pattern matching solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing is something I cannot use since I work with Java 1.1 on a

Re: which HTML parser is better?

2005-02-03 Thread Karl Koch
; >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>>I think that depends on what you want to do.

Re: which HTML parser is better?

2005-02-03 Thread sergiu gordea
-Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned

Re: which HTML parser is better?

2005-02-03 Thread Karl Koch
I appologise in advance, if some of my writing here has been said before. The last three answers to my question have been suggesting pattern matching solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing is something I cannot use since I work with Java 1.1 on a PDA. I am wonde

Re: which HTML parser is better?

2005-02-03 Thread sergiu gordea
day, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-cr

Re: which HTML parser is better?

2005-02-03 Thread Karl Koch
> >>>>>> > > >>>>>> > > >>>>>> > > >>>>does > > >>>> > > >>>> > > >>>> > > >>>> > > >>>>>>simple mapping of HTML files into Lucene Docu

Re: which HTML parser is better?

2005-02-03 Thread Karl Koch
give > >>>>>> > >>>>>> > >>you > >> > >> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>a > >>

Re: which HTML parser is better?

2005-02-02 Thread sergiu gordea
AM > To: lucene-user@jakarta.apache.org > Subject: which HTML parser is better? > > Three HTML parsers(Lucene web application > demo,CyberNeko HTML Parser,JTidy) are mentioned in > Lucene FAQ > 1.3.27.Which is the best?Can it filter tags that are > auto-cr

Re: which HTML parser is better?

2005-02-02 Thread Bill Tschumy
No one has yet mentioned using ParserDelegator and ParserCallback that are part of HTMLEditorKit in Swing. I have been successfully using these classes to parse out the text of an HTML file. You just need to extend HTMLEditorKit.ParserCallback and override the various methods that are called

RE: which HTML parser is better?

2005-02-02 Thread Kauler, Leto S
- > > > From: Jingkang Zhang [mailto:[EMAIL PROTECTED] > > > Sent: Tuesday, February 01, 2005 1:15 AM > > > To: lucene-user@jakarta.apache.org > > > Subject: which HTML parser is better? > > > > > > Three HTML parsers(Lucene web

Re: which HTML parser is better?

2005-02-02 Thread Luke Shannon
" Sent: Wednesday, February 02, 2005 1:23 PM Subject: Re: which HTML parser is better? > Karl Koch wrote: > > >I am in control of the html, which means it is well formated HTML. I use > >only HTML files which I have transformed from XML. No external HTML (e.g. > >th

Re: which HTML parser is better?

2005-02-02 Thread Otis Gospodnetic
; >>>>>>simple mapping of HTML files into Lucene Documents; it does not > give > >>>>>> > >>>>>> > >>you > >> > >> > >>>>>> > >>>>>> > >>>>>> >

Re: which HTML parser is better?

2005-02-02 Thread sergiu gordea
ke it. It has been robust for me so far. Chuck -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better?

Re: which HTML parser is better?

2005-02-02 Thread Karl Koch
;>> > >>document > >> > >> > >>>>into a full DOM that you can manipulate easily for a wide range of > >>>>purposes. I haven't used JTidy at an API level and so don't know it > as > >>>> > >>

Re: which HTML parser is better?

2005-02-02 Thread sergiu gordea
AIL PROTECTED] > Sent: Tuesday, February 01, 2005 1:15 AM > To: lucene-user@jakarta.apache.org > Subject: which HTML parser is better? > > Three HTML parsers(Lucene web application > demo,CyberNeko HTML Parser,JTidy) are mentioned in > Lucene FAQ > 1.3.27.Which is the best?Can

Re: which HTML parser is better?

2005-02-02 Thread Karl Koch
and > >>error detection/correction. > >> > >>I use CyberNeko for a range of operations on HTML documents that go > beyond > >>indexing them in Lucene, and really like it. It has been robust for me > so > >>far. > >> > >>Chuck >

Re: which HTML parser is better?

2005-02-02 Thread Erik Hatcher
On Feb 2, 2005, at 6:17 AM, Karl Koch wrote: Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CS

Re: which HTML parser is better?

2005-02-02 Thread sergiu gordea
ser@jakarta.apache.org > Subject: which HTML parser is better? > > Three HTML parsers(Lucene web application > demo,CyberNeko HTML Parser,JTidy) are mentioned in > Lucene FAQ > 1.3.27.Which is the best?Can it filter tags that are > auto-creat

RE: which HTML parser is better?

2005-02-02 Thread Karl Koch
has been robust for me so > far. > > Chuck > > > -Original Message- > > From: Jingkang Zhang [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, February 01, 2005 1:15 AM > > To: lucene-user@jakarta.apache.org > > Subject: which HTML parser is bet

RE: which HTML parser is better?

2005-02-01 Thread Chuck Williams
ser@jakarta.apache.org > Subject: which HTML parser is better? > > Three HTML parsers(Lucene web application > demo,CyberNeko HTML Parser,JTidy) are mentioned in > Lucene FAQ > 1.3.27.Which is the best?Can it filter tags that are > auto-creat

Re: which HTML parser is better?

2005-02-01 Thread Michael Giles
When I tested parsers a year or so ago for intensive use in Furl, the best (tolerant of bad HTML) and fastest (tested on a 1.5M HTML page) parser by far was TagSoup ( http://www.tagsoup.info ). It is actively maintained and improved and I have never had any problems with it. -Mike Jingkang Zhang

Re: which HTML parser is better?

2005-02-01 Thread sergiu gordea
Jingkang Zhang wrote: >Three HTML parsers(Lucene web application >demo,CyberNeko HTML Parser,JTidy) are mentioned in >Lucene FAQ >1.3.27.Which is the best?Can it filter tags that are >auto-created by MS-word 'Save As HTML files' function? > > maybe you can try this library... http://htmlparser

which HTML parser is better?

2005-02-01 Thread Jingkang Zhang
Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 150万曲MP3疯狂搜,带