Paul, You don't have to reimplement all of the HTMLParser, just write a HtmlParseFilter and which is much simpler. Otherwise you can of course modify HTMLParser directly so that it does what you need.
J. 2009/8/21 Paul Tomblin <[email protected]> > On Fri, Aug 21, 2009 at 4:20 AM, Julien > Nioche<[email protected]> wrote: > > ou'll need to write a custom parser implementing HtmlParseFilter and get > it > > to store the keywords found in the Metadata, then write a custom Indexer. > > > > By default the HTML parser does not do anything about meta tags. > > That's unfortunate, because org.apache.nutch.parse.html.HtmlParser > actually extracts all the meta tags, and then takes a few and throws > the rest away. It's mildly annoying that I'm going to have to > re-implement all of HtmlParser just to add two lines to take that data > out of "metaTags" and put it in "content.getMetaData()". > > -- > http://www.linkedin.com/in/paultomblin > -- DigitalPebble Ltd http://www.digitalpebble.com
