Re: Text extraction from HTML

Patrick Kimber Fri, 29 Jul 2005 01:14:33 -0700

Hi Giovanni
We are using the Neko HTML parser.  Some simple example code can be
found in the "Lucene in Action" book.


For more information:
http://www.manning.com/books/hatcher2
http://www.apache.org/~andyc/neko/doc/html/

Patrick

On 29/07/05, Giovanni Novelli <[EMAIL PROTECTED]> wrote:
> Hello,
> I'm working to the development of a multi-agents software that
> involves some information indexing, information retrieval and
> information categorization tasks. I want to build the training set for
> categorization using a set of HTML pages fetched from DMOZ RDF dumps.
> I have tried the HtmlParser coming with Nutch but I wasn't able to
> make it work without adjusting global configuration Nutch's xml;
> perhaps it's the only way to make such plugin work? Does Lucene expose
> any good HTML parser in the contrib section to parse web pages found
> in the wild?
> 
> Best regards,
> Giovanni Novelli
> 
> P.S.: This is a crosspost as I'm relying on both Lucene and Nutch.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Text extraction from HTML

Reply via email to