Hi Giovanni We are using the Neko HTML parser. Some simple example code can be found in the "Lucene in Action" book.
For more information: http://www.manning.com/books/hatcher2 http://www.apache.org/~andyc/neko/doc/html/ Patrick On 29/07/05, Giovanni Novelli <[EMAIL PROTECTED]> wrote: > Hello, > I'm working to the development of a multi-agents software that > involves some information indexing, information retrieval and > information categorization tasks. I want to build the training set for > categorization using a set of HTML pages fetched from DMOZ RDF dumps. > I have tried the HtmlParser coming with Nutch but I wasn't able to > make it work without adjusting global configuration Nutch's xml; > perhaps it's the only way to make such plugin work? Does Lucene expose > any good HTML parser in the contrib section to parse web pages found > in the wild? > > Best regards, > Giovanni Novelli > > P.S.: This is a crosspost as I'm relying on both Lucene and Nutch. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]