HttpUnit (which uses JTidy under the covers) makes childs play out of
pulling out links and navigating to them.

The only caveat (and this would be true for practically all tools, I
suspect) is that the HTML has to be relatively well-formed for it to work
well.  JTidy can be somewhat forgiving though.

    Erik

----- Original Message -----
From: "David Black" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Friday, April 19, 2002 5:26 PM
Subject: Re: HTML parser


> While trying to research the same thing, I found the following...here's
> a good example of link extraction.....
>
> http://developer.java.sun.com/developer/TechTips/1999/tt0923.html
>
> It seems like I could use this to also get the text out from between the
> tags but haven't been able to do it yet.  It seems like it should be
> simple but geez...my head hurts.
>
>
>
>
>
>
> On Friday, April 19, 2002, at 01:40 PM, Ian Forsyth wrote:
>
> >
> > Are there core classes part of lucene that allow one to feed lucene
> > links,
> > and 'it' will capture the contents of those urls into the index..
> >
> > or does one write a file capture class to seek out the url store the
> > file in
> > a directory, then index the local directory..
> >
> > Ian
> >
> >
> > -----Original Message-----
> > From: Terence Parr [mailto:[EMAIL PROTECTED]]
> > Sent: Friday, April 19, 2002 1:38 AM
> > To: Lucene Users List
> > Subject: Re: HTML parser
> >
> >
> >
> > On Thursday, April 18, 2002, at 10:28  PM, Otis Gospodnetic wrote:
> >
> > :snip
> >
> > Hi Otis,
> >
> > I have an HTML parser built for ANTLR, but it's pretty strict in what it
> > accepts.  Not sure how useful it will be for you, but here it is:
> >
> > http://www.antlr.org/grammars/HTML
> >
> > I am not sure what your goal is, but I personally have to scarf all
> > sorts of HTML from various websites to such them into the jGuru search
> > engine.  I use a simple stripHTML() method I wrote to handle it.  Works
> > great.  Kills everything but the text.  is that the kind of thing you
> > are looking for or do you really want to parse not filter?
> >
> > Terence
> > --
> > Co-founder, http://www.jguru.com
> > Creator, ANTLR Parser Generator: http://www.antlr.org
> >
> >
> > --
> > To unsubscribe, e-mail:
> > <mailto:[EMAIL PROTECTED]>
> > For additional commands, e-mail:
> > <mailto:[EMAIL PROTECTED]>
> >
> >
> >
> > --
> > To unsubscribe, e-mail:   <mailto:lucene-user-
> > [EMAIL PROTECTED]>
> > For additional commands, e-mail: <mailto:lucene-user-
> > [EMAIL PROTECTED]>
> >
>
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>
>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to