Re: Re: The problem of using Cyber Neko HTML Parser parse HTML files

2005-02-17 Thread Jingkang Zhang
ue > 0x00A0) > > > - Original Message - > From: "Jingkang Zhang" <[EMAIL PROTECTED]> > To: > Sent: Friday, February 18, 2005 5:12 PM > Subject: The problem of using Cyber Neko HTML Parser > parse HTML files > > > > When I was usin

Re: The problem of using Cyber Neko HTML Parser parse HTML files

2005-02-17 Thread Jason Polites
This is not an unknown character.. it is a non breaking space (unicode value 0x00A0) - Original Message - From: "Jingkang Zhang" <[EMAIL PROTECTED]> To: Sent: Friday, February 18, 2005 5:12 PM Subject: The problem of using Cyber Neko HTML Parser parse HTML files

The problem of using Cyber Neko HTML Parser parse HTML files

2005-02-17 Thread Jingkang Zhang
When I was using Cyber Neko HTML Parser parse HTML files( created by Microsoft word ), if the file contains HTML built-in entity references(for example:  ) , node value may contain unknown character. Like this: source html: -rw-r--r--    1 root root   50 Jan 21 16:12 _1e.f6 after

Re: which HTML parser is better?

2005-02-04 Thread Ian Soboroff
Oops. It's in the Google cache and also the Internet Archive Wayback machine. I'll drop the original author a note to let him know that his links are stale. http://web.archive.org/web/20040208014740/http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/ Ian "Karl Koch&quo

Re: which HTML parser is better?

2005-02-04 Thread Karl Koch
The link does not work. > > One which we've been using can be found at: > http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/ > > We absolutely need to be able to recover gracefully from malformed > HTML and/or SGML. Most of the nicer SAX/DOM/TLA parsers out there

Re: which HTML parser is better?

2005-02-03 Thread Ian Soboroff
One which we've been using can be found at: http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/ We absolutely need to be able to recover gracefully from malformed HTML and/or SGML. Most of the nicer SAX/DOM/TLA parsers out there failed this criterion when we started our effort. The

Re: which HTML parser is better?

2005-02-03 Thread aurora
Callback and override the various methods that are called when different tags are encountered. On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote: Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags th

Re: which HTML parser is better? - Thread closed

2005-02-03 Thread Karl Koch
g. I have been successfully using > >>these classes to parse out the text of an HTML file. You just need to > >>extend HTMLEditorKit.ParserCallback and override the various methods > >>that are called when different tags are encountered. > >> > >> >

Re: which HTML parser is better?

2005-02-03 Thread Dawid Weiss
xtend HTMLEditorKit.ParserCallback and override the various methods that are called when different tags are encountered. On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote: Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags

Re: which HTML parser is better?

2005-02-03 Thread sergiu gordea
classes to parse out the text of an HTML file. You just need to extend HTMLEditorKit.ParserCallback and override the various methods that are called when different tags are encountered. On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote: Three HTML parsers(Lucene web application demo,CyberNeko HT

Re: which HTML parser is better?

2005-02-03 Thread Karl Koch
; >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>>>I think that depends on what you want to do.

Re: which HTML parser is better?

2005-02-03 Thread sergiu gordea
-Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned

Re: which HTML parser is better?

2005-02-03 Thread Karl Koch
ML parsers(Lucene web application > > demo,CyberNeko HTML Parser,JTidy) are mentioned in > > Lucene FAQ > > 1.3.27.Which is the best?Can it filter tags that are > > auto-created by MS-word 'Save As HTML files' function? > -

Re: which HTML parser is better?

2005-02-03 Thread sergiu gordea
day, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-cr

Re: which HTML parser is better?

2005-02-03 Thread Karl Koch
> >>>>>> > > >>>>>> > > >>>>>> > > >>>>does > > >>>> > > >>>> > > >>>> > > >>>> > > >>>>>>simple mapping of HTML files into Lucene Docu

Re: which HTML parser is better?

2005-02-03 Thread Karl Koch
give > >>>>>> > >>>>>> > >>you > >> > >> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>a > >>

Re: which HTML parser is better?

2005-02-02 Thread sergiu gordea
AM > To: lucene-user@jakarta.apache.org > Subject: which HTML parser is better? > > Three HTML parsers(Lucene web application > demo,CyberNeko HTML Parser,JTidy) are mentioned in > Lucene FAQ > 1.3.27.Which is the best?Can it filter tags that are > auto-cr

Re: which HTML parser is better?

2005-02-02 Thread Bill Tschumy
when different tags are encountered. On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote: Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files'

RE: which HTML parser is better?

2005-02-02 Thread Kauler, Leto S
- > > > From: Jingkang Zhang [mailto:[EMAIL PROTECTED] > > > Sent: Tuesday, February 01, 2005 1:15 AM > > > To: lucene-user@jakarta.apache.org > > > Subject: which HTML parser is better? > > > > > > Three HTML parsers(Lucene web

Re: which HTML parser is better?

2005-02-02 Thread Luke Shannon
" Sent: Wednesday, February 02, 2005 1:23 PM Subject: Re: which HTML parser is better? > Karl Koch wrote: > > >I am in control of the html, which means it is well formated HTML. I use > >only HTML files which I have transformed from XML. No external HTML (e.g. > >th

Re: which HTML parser is better?

2005-02-02 Thread Otis Gospodnetic
; >>>>>>simple mapping of HTML files into Lucene Documents; it does not > give > >>>>>> > >>>>>> > >>you > >> > >> > >>>>>> > >>>>>> > >>>>>> >

Re: which HTML parser is better?

2005-02-02 Thread sergiu gordea
ke it. It has been robust for me so far. Chuck -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better?

Re: which HTML parser is better?

2005-02-02 Thread Karl Koch
;>> > >>document > >> > >> > >>>>into a full DOM that you can manipulate easily for a wide range of > >>>>purposes. I haven't used JTidy at an API level and so don't know it > as > >>>> > >>

Re: which HTML parser is better?

2005-02-02 Thread sergiu gordea
AIL PROTECTED] > Sent: Tuesday, February 01, 2005 1:15 AM > To: lucene-user@jakarta.apache.org > Subject: which HTML parser is better? > > Three HTML parsers(Lucene web application > demo,CyberNeko HTML Parser,JTidy) are mentioned in > Lucene FAQ > 1.3.27.Which is the best?Can

Re: which HTML parser is better?

2005-02-02 Thread Karl Koch
and > >>error detection/correction. > >> > >>I use CyberNeko for a range of operations on HTML documents that go > beyond > >>indexing them in Lucene, and really like it. It has been robust for me > so > >>far. > >> > >>Chuck >

Re: which HTML parser is better?

2005-02-02 Thread Erik Hatcher
On Feb 2, 2005, at 6:17 AM, Karl Koch wrote: Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CS

Re: which HTML parser is better?

2005-02-02 Thread sergiu gordea
ser@jakarta.apache.org > Subject: which HTML parser is better? > > Three HTML parsers(Lucene web application > demo,CyberNeko HTML Parser,JTidy) are mentioned in > Lucene FAQ > 1.3.27.Which is the best?Can it filter tags that are > auto-creat

RE: which HTML parser is better?

2005-02-02 Thread Karl Koch
has been robust for me so > far. > > Chuck > > > -Original Message- > > From: Jingkang Zhang [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, February 01, 2005 1:15 AM > > To: lucene-user@jakarta.apache.org > > Subject: which HTML parser is bet

RE: which HTML parser is better?

2005-02-01 Thread Chuck Williams
ser@jakarta.apache.org > Subject: which HTML parser is better? > > Three HTML parsers(Lucene web application > demo,CyberNeko HTML Parser,JTidy) are mentioned in > Lucene FAQ > 1.3.27.Which is the best?Can it filter tags that are > auto-creat

Re: which HTML parser is better?

2005-02-01 Thread Michael Giles
wrote: >Three HTML parsers(Lucene web application >demo,CyberNeko HTML Parser,JTidy) are mentioned in >Lucene FAQ >1.3.27.Which is the best?Can it filter tags that are >auto-created by MS-word 'Save As HTML files' function? > >

Re: which HTML parser is better?

2005-02-01 Thread sergiu gordea
Jingkang Zhang wrote: >Three HTML parsers(Lucene web application >demo,CyberNeko HTML Parser,JTidy) are mentioned in >Lucene FAQ >1.3.27.Which is the best?Can it filter tags that are >auto-created by MS-word 'Save As HTML files' function? > > maybe yo

which HTML parser is better?

2005-02-01 Thread Jingkang Zhang
Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 1

Re: demo HTML parser question

2004-09-23 Thread roy-lucene-user
On Thu, 23 Sep 2004 10:53:26 -0700, Doug Cutting wrote > [EMAIL PROTECTED] wrote: > > We were originally attempting to use the demo html parser (Lucene 1.2), but as > > you know, its for a demo. I think its threaded to optimize on time, to allow > > the calling thread to

Re: demo HTML parser question

2004-09-23 Thread Doug Cutting
[EMAIL PROTECTED] wrote: We were originally attempting to use the demo html parser (Lucene 1.2), but as you know, its for a demo. I think its threaded to optimize on time, to allow the calling thread to grab the title or top message even though its not done parsing the entire html document

Re: demo HTML parser question

2004-09-23 Thread roy-lucene-user
Hi Fred, We were originally attempting to use the demo html parser (Lucene 1.2), but as you know, its for a demo. I think its threaded to optimize on time, to allow the calling thread to grab the title or top message even though its not done parsing the entire html document. That's just a

demo HTML parser question

2004-09-22 Thread Fred Toth
Hi, I've been working with the HTML parser demo that comes with Lucene and I'm trying to understand why it's multi-threaded, and, more importantly, how to exit gracefully on errors. I've discovered if I throw an exception in the front-end static code (main(), etc.), the

Re: Best HTML Parser !!

2003-02-26 Thread Nestel, Frank IZ/HZA-IC4
arser? Thanks Frank > -Ursprüngliche Nachricht- > Von: petite_abeille [mailto:[EMAIL PROTECTED] > Gesendet: Dienstag, 25. Februar 2003 19:49 > An: Lucene Users List > Betreff: Re: Best HTML Parser !! > > > > On Monday, Feb 24, 2003, at 20:28 Europe/Zurich, Luk

Re: Best HTML Parser !!

2003-02-25 Thread petite_abeille
On Monday, Feb 24, 2003, at 20:28 Europe/Zurich, Lukas Zapletal wrote: I have some good experiences with JTidy. It works like DOM-XML parser and cleans HTML it by the way. I use jtidy also. Both for parsing and clean-up. Works pretty nicely. This is VERY useful, because EVERY HTML have at least

Re: Best HTML Parser !!

2003-02-25 Thread Lukas Zapletal
Pierre Lacchini wrote: Hello, i'm trying to index html file with Lucene. Do u know what's the best HTML Parser in Java ? The most Powerful ? I need to extract meta-tag, and many other differents text fields... Thx for ur help ;) I have some good experiences with JTidy. It work

AW: Best HTML Parser !!

2003-02-24 Thread Borkenhagen, Michael (ofd-ko zdfin)
I prefer JTidy http://lempinen.net/sami/jtidy/. Michael -Ursprüngliche Nachricht- Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Gesendet: Montag, 24. Februar 2003 15:03 An: Lucene Users List; [EMAIL PROTECTED] Betreff: Re: Best HTML Parser !! It's not possible to generalize like

Re: Best HTML Parser !!

2003-02-24 Thread Otis Gospodnetic
It's not possible to generalize like that. I like NekoHTML. Otis --- Pierre Lacchini <[EMAIL PROTECTED]> wrote: > Hello, > > i'm trying to index html file with Lucene. > Do u know what's the best HTML Parser in Java ? > The most Powerful ? > I n

Best HTML Parser !!

2003-02-24 Thread Pierre Lacchini
Hello, i'm trying to index html file with Lucene. Do u know what's the best HTML Parser in Java ? The most Powerful ? I need to extract meta-tag, and many other differents text fields... Thx for ur help ;)

Demo provided HTML parser bug (was RE: Newbie quizzes further...)

2002-09-06 Thread Stone, Timothy
List Fellows: Lacking any knowledge of JavaCC, I solicted help in hacking the HTMLParser.jj included in the demo. I retreat from this solication, for two reasons: 1) I'm using other ideas gleaned from the list archives, 2) I'm not prepared to dive into the world of complier compliers. The mere so

Re: problems with HTML Parser

2002-08-14 Thread Keith Gunn
ED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Wednesday, August 14, 2002 9:46 AM > Subject: problems with HTML Parser > > > > Has anyone noticed that the HTML Parser that comes with > > Lucene joins terms together when parsing a file. >

Re: problems with HTML Parser

2002-08-14 Thread Ben Litchfield
Maurits, You can get a PDF parser from http://www.pdfbox.org -Ben On Wed, 14 Aug 2002, Maurits van Wijland wrote: > Keith, > > I haven't noticed the problem with the Parser...but you trigger me > by saying that you have a PDFParser!!! > > Are you able to contribute this PDFParser?? > > Maurit

Re: problems with HTML Parser

2002-08-14 Thread Maurits van Wijland
[EMAIL PROTECTED]> Sent: Wednesday, August 14, 2002 9:46 AM Subject: problems with HTML Parser > Has anyone noticed that the HTML Parser that comes with > Lucene joins terms together when parsing a file. > I used to think it was my PDFParser but after fixing that > I found o

problems with HTML Parser

2002-08-14 Thread Keith Gunn
Has anyone noticed that the HTML Parser that comes with Lucene joins terms together when parsing a file. I used to think it was my PDFParser but after fixing that I found out it was the HMTLParser. I managed to find a replacement parser that doesn't join terms. Just wondered if anyone had

Re: HTML parser

2002-04-20 Thread [EMAIL PROTECTED]
e following...here 's a > >good example of link extraction. > > Try http://www.quiotix.com/opensource/html-parser > > Its easy to write a Visitor which extracts the links; should take abou t ten > lines of code. > > > > -- > Brian Goetz >

Re: HTML parser

2002-04-19 Thread Brian Goetz
>While trying to research the same thing, I found the following...here's a >good example of link extraction. Try http://www.quiotix.com/opensource/html-parser Its easy to write a Visitor which extracts the links; should take about ten lines of code. -- Brian Goetz Quiotix

Re: HTML parser

2002-04-19 Thread Erik Hatcher
though. Erik - Original Message - From: "David Black" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Friday, April 19, 2002 5:26 PM Subject: Re: HTML parser > While trying to research the same thing, I found the followin

Re: HTML parser

2002-04-19 Thread David Black
index.. > > or does one write a file capture class to seek out the url store the > file in > a directory, then index the local directory.. > > Ian > > > -Original Message- > From: Terence Parr [mailto:[EMAIL PROTECTED]] > Sent: Friday, April 19, 2002 1:38

RE: HTML parser

2002-04-19 Thread Otis Gospodnetic
sage- > From: Terence Parr [mailto:[EMAIL PROTECTED]] > Sent: Friday, April 19, 2002 1:38 AM > To: Lucene Users List > Subject: Re: HTML parser > > > > On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote: > > :snip > > Hi Otis, > > I have an

RE: HTML parser

2002-04-19 Thread Ian Forsyth
al Message- From: Terence Parr [mailto:[EMAIL PROTECTED]] Sent: Friday, April 19, 2002 1:38 AM To: Lucene Users List Subject: Re: HTML parser On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote: :snip Hi Otis, I have an HTML parser built for ANTLR, but it's pretty st

Re: HTML parser

2002-04-19 Thread Terence Parr
> > In a future I may need something that has the ability to extract HREFs, > but I'll stick to one of the XP principles and just look for something > that meets current needs :) > > I looked for ANTLR-based HTML parser a few days ago, but must have > missed the one yo

RE: HTML parser

2002-04-19 Thread Mark Ayad
You can use the swing html parser to do this but it's only a 3.2 DTD based parser. I have written (attached) a totall hack job for braking up an html page into its component parts, the code gives you an idea ... If anyone wants to know how to use the swing based parser I add some code ?

Re: HTML parser

2002-04-18 Thread Otis Gospodnetic
doing that I, like you, need to be able to handle poorly formatted web pages. In a future I may need something that has the ability to extract HREFs, but I'll stick to one of the XP principles and just look for something that meets current needs :) I looked for ANTLR-based HTML parser a few day

Re: HTML parser

2002-04-18 Thread Terence Parr
On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote: > Hello, > > I need to select an HTML parser for the application that I'm writing > and I'm not sure what to choose. > The HTML parser included with Lucene looks flimsy, JTidy looks like a > hack an

HTML parser

2002-04-18 Thread Otis Gospodnetic
Hello, I need to select an HTML parser for the application that I'm writing and I'm not sure what to choose. The HTML parser included with Lucene looks flimsy, JTidy looks like a hack and an overkill, using classes written for Swing (javax.swing.text.html.parser) seems wrong, and I hav

RE: HTML Parser

2002-04-09 Thread Mark Ayad
Neal Thats because the HTMLParser.jj is NOT a java file it contains the grammar for the JavaCC, have a look at http://www.quiotix.com/downloads/html-parser/ Regards Mark -Original Message- From: Neal Weinstein [mailto:[EMAIL PROTECTED]] Sent: 09 April 2002 16:21 To: '[

HTML Parser

2002-04-09 Thread Neal Weinstein
Hi, I am working with the lucene demo and would like to compile the demo so that I may eventually modify it for my own use. I am using the source from lucene-demos-1.2-rc4.jar.zip. However, the HTMLParser class had the filename HTMLParser.jj and won't compile. I changed the name to HTMLParser.ja

RE: HTML Parser

2001-12-18 Thread Karl Øie
Parser Hi, How should I integrate the HTML Parser (which is in the demo directory) in a new project ? In particular with the HTMLParser.jj file. Do a need to compile it before trying to use it in my code. Any help would be apreciated ! Thank. - Christophe -- To unsubscribe, e-mail

HTML Parser

2001-12-17 Thread Christophe GOGUYER DESSAGNES
Hi, How should I integrate the HTML Parser (which is in the demo directory) in a new project ? In particular with the HTMLParser.jj file. Do a need to compile it before trying to use it in my code. Any help would be apreciated ! Thank. - Christophe -- To unsubscribe, e-mail: <mai