Such classes are not included with Lucene. This was _just_ mentioned on this list earlier today. Look at the archives and search for crawler, URL, lucene sandbox, etc.
Otis --- Ian Forsyth <[EMAIL PROTECTED]> wrote: > > Are there core classes part of lucene that allow one to feed lucene > links, > and 'it' will capture the contents of those urls into the index.. > > or does one write a file capture class to seek out the url store the > file in > a directory, then index the local directory.. > > Ian > > > -----Original Message----- > From: Terence Parr [mailto:[EMAIL PROTECTED]] > Sent: Friday, April 19, 2002 1:38 AM > To: Lucene Users List > Subject: Re: HTML parser > > > > On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote: > > :snip > > Hi Otis, > > I have an HTML parser built for ANTLR, but it's pretty strict in what > it > accepts. Not sure how useful it will be for you, but here it is: > > http://www.antlr.org/grammars/HTML > > I am not sure what your goal is, but I personally have to scarf all > sorts of HTML from various websites to such them into the jGuru > search > engine. I use a simple stripHTML() method I wrote to handle it. > Works > great. Kills everything but the text. is that the kind of thing you > are looking for or do you really want to parse not filter? > > Terence > -- > Co-founder, http://www.jguru.com > Creator, ANTLR Parser Generator: http://www.antlr.org > > > -- > To unsubscribe, e-mail: > <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: > <mailto:[EMAIL PROTECTED]> > > > > -- > To unsubscribe, e-mail: > <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: > <mailto:[EMAIL PROTECTED]> > __________________________________________________ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>