Re: HTML parser

2002-04-20 Thread [EMAIL PROTECTED]
Hi all, I'm very interested about this thread. I also have to solve the problem of spidering web sites, creating index (weel about this there is the BIG problem that lucene can't be integrated easily with a DB), extracting links from the page repeating all the process. For extracting l

Re: HTML parser

2002-04-19 Thread Brian Goetz
>While trying to research the same thing, I found the following...here's a >good example of link extraction. Try http://www.quiotix.com/opensource/html-parser Its easy to write a Visitor which extracts the links; should take about ten lines of code. -- Brian Goetz Quiotix Corporation [

Re: HTML parser

2002-04-19 Thread Erik Hatcher
though. Erik - Original Message - From: "David Black" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Friday, April 19, 2002 5:26 PM Subject: Re: HTML parser > While trying to research the same thing, I found the followin

Re: HTML parser

2002-04-19 Thread David Black
index.. > > or does one write a file capture class to seek out the url store the > file in > a directory, then index the local directory.. > > Ian > > > -Original Message- > From: Terence Parr [mailto:[EMAIL PROTECTED]] > Sent: Friday, April 19, 2002 1:38

RE: HTML parser

2002-04-19 Thread Otis Gospodnetic
sage- > From: Terence Parr [mailto:[EMAIL PROTECTED]] > Sent: Friday, April 19, 2002 1:38 AM > To: Lucene Users List > Subject: Re: HTML parser > > > > On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote: > > :snip > > Hi Otis, > > I have an

RE: HTML parser

2002-04-19 Thread Ian Forsyth
al Message- From: Terence Parr [mailto:[EMAIL PROTECTED]] Sent: Friday, April 19, 2002 1:38 AM To: Lucene Users List Subject: Re: HTML parser On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote: :snip Hi Otis, I have an HTML parser built for ANTLR, but it's pretty st

Re: HTML parser

2002-04-19 Thread Terence Parr
Hi Otis, The idea behind stripHTML is pretty simple. It's just a hand-built lexer that looks like this: while more char if comment start, scarf til end comment if char is < then if SCRIPT tag scarf til end SCRIPT; [same with A, STYLE, HEA

RE: HTML parser

2002-04-19 Thread Mark Ayad
You can use the swing html parser to do this but it's only a 3.2 DTD based parser. I have written (attached) a totall hack job for braking up an html page into its component parts, the code gives you an idea ... If anyone wants to know how to use the swing based parser I add some code ? Mark

Re: HTML parser

2002-04-18 Thread Otis Gospodnetic
Hello Terrence, Ah, you got me. I guess I need a bit of both. I need to just strip HTML and get raw body text so that I can stick it in Lucene's index. I would also like something that can extract at least the ... stuff, so that I can stick that in a separate field in Lucene index. While doing th

Re: HTML parser

2002-04-18 Thread Terence Parr
On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote: > Hello, > > I need to select an HTML parser for the application that I'm writing > and I'm not sure what to choose. > The HTML parser included with Lucene looks flimsy, JTidy looks like a > hack and an overkill, using classes wr

RE: HTML Parser

2002-04-09 Thread Mark Ayad
Neal Thats because the HTMLParser.jj is NOT a java file it contains the grammar for the JavaCC, have a look at http://www.quiotix.com/downloads/html-parser/ Regards Mark -Original Message- From: Neal Weinstein [mailto:[EMAIL PROTECTED]] Sent: 09 April 2002 16:21 To: '[EMAIL PROTECTE

RE: HTML Parser

2001-12-18 Thread Karl Øie
*.jj files are compiled with javacc, there is a javacc.zip file in your lib directory, but you should download the compilerset. mvh karl øie -Original Message- From: Christophe GOGUYER DESSAGNES [mailto:[EMAIL PROTECTED]] Sent: 17. desember 2001 17:32 To: [EMAIL PROTECTED] Subject: HTML