Paul,

You don't have to reimplement all of the HTMLParser, just write a
HtmlParseFilter and which is much simpler. Otherwise you can of course
modify HTMLParser directly so that it does what you need.

J.


2009/8/21 Paul Tomblin <[email protected]>

> On Fri, Aug 21, 2009 at 4:20 AM, Julien
> Nioche<[email protected]> wrote:
> > ou'll need to write a custom parser implementing HtmlParseFilter and get
> it
> > to store the keywords found in the Metadata, then write a custom Indexer.
> >
> > By default the HTML parser does not do anything about meta tags.
>
> That's unfortunate, because org.apache.nutch.parse.html.HtmlParser
> actually extracts all the meta tags, and then takes a few and throws
> the rest away.  It's mildly annoying that I'm going to have to
> re-implement all of HtmlParser just to add two lines to take that data
> out of "metaTags" and put it in "content.getMetaData()".
>
> --
> http://www.linkedin.com/in/paultomblin
>



-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Reply via email to