Text extraction from HTML

Giovanni Novelli Fri, 29 Jul 2005 00:17:49 -0700

Hello,
I'm working to the development of a multi-agents software that
involves some information indexing, information retrieval and
information categorization tasks. I want to build the training set for
categorization using a set of HTML pages fetched from DMOZ RDF dumps.
I have tried the HtmlParser coming with Nutch but I wasn't able to
make it work without adjusting global configuration Nutch's xml;
perhaps it's the only way to make such plugin work? Does Lucene expose
any good HTML parser in the contrib section to parse web pages found
in the wild?


Best regards,
Giovanni Novelli

P.S.: This is a crosspost as I'm relying on both Lucene and Nutch.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Text extraction from HTML

Reply via email to