Hello, I'm working to the development of a multi-agents software that involves some information indexing, information retrieval and information categorization tasks. I want to build the training set for categorization using a set of HTML pages fetched from DMOZ RDF dumps. I have tried the HtmlParser coming with Nutch but I wasn't able to make it work without adjusting global configuration Nutch's xml; perhaps it's the only way to make such plugin work? Does Lucene expose any good HTML parser in the contrib section to parse web pages found in the wild?
Best regards, Giovanni Novelli P.S.: This is a crosspost as I'm relying on both Lucene and Nutch. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]