While trying to research the same thing, I found the following...here's 
a good example of link extraction.....

http://developer.java.sun.com/developer/TechTips/1999/tt0923.html

It seems like I could use this to also get the text out from between the 
tags but haven't been able to do it yet.  It seems like it should be 
simple but geez...my head hurts.






On Friday, April 19, 2002, at 01:40 PM, Ian Forsyth wrote:

>
> Are there core classes part of lucene that allow one to feed lucene 
> links,
> and 'it' will capture the contents of those urls into the index..
>
> or does one write a file capture class to seek out the url store the 
> file in
> a directory, then index the local directory..
>
> Ian
>
>
> -----Original Message-----
> From: Terence Parr [mailto:[EMAIL PROTECTED]]
> Sent: Friday, April 19, 2002 1:38 AM
> To: Lucene Users List
> Subject: Re: HTML parser
>
>
>
> On Thursday, April 18, 2002, at 10:28  PM, Otis Gospodnetic wrote:
>
> :snip
>
> Hi Otis,
>
> I have an HTML parser built for ANTLR, but it's pretty strict in what it
> accepts.  Not sure how useful it will be for you, but here it is:
>
> http://www.antlr.org/grammars/HTML
>
> I am not sure what your goal is, but I personally have to scarf all
> sorts of HTML from various websites to such them into the jGuru search
> engine.  I use a simple stripHTML() method I wrote to handle it.  Works
> great.  Kills everything but the text.  is that the kind of thing you
> are looking for or do you really want to parse not filter?
>
> Terence
> --
> Co-founder, http://www.jguru.com
> Creator, ANTLR Parser Generator: http://www.antlr.org
>
>
> --
> To unsubscribe, e-mail:
> <mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
> <mailto:[EMAIL PROTECTED]>
>
>
>
> --
> To unsubscribe, e-mail:   <mailto:lucene-user-
> [EMAIL PROTECTED]>
> For additional commands, e-mail: <mailto:lucene-user-
> [EMAIL PROTECTED]>
>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to