Hi Albin,

You don't have to have a separate plugin for each html structure you want
to parse. You can have a single plugin with multiple HTMLParseFilters.

Having a generic extractor with the extraction logic configured in an
external file is definitely a good idea and would make a great contribution
to the project. In a nutshell, you haven't missed anything and that wheel
definitely needs inventing ;-)



On 25 September 2014 09:24, Albin Vigier <albinsc...@gmail.com> wrote:

> Hello everybody,
> I'm just wondering if it is possible to fetch specific metadata with
> an existing nutch plugin.
> Let's take an example.
> I want to extract some metadata from "div" or "td" tags from html
> pages that have specific ids and name them the way I like (this is
> done at parser time).
> Then, at indexer time, I would use index-metadata (a very good plugin)
> to add my custom metadata.
> Currently from what I've seen on the wiki and by quickly analyzing
> plugins I suppose I have to code my own plugin each time I've got a
> new site (with a new html structure). I've already done that by using
> a node walker in a custom htmlParseFilter but the extraction can be a
> little bit boring :)
> So on my side i've coded a little plugin that enables me to specify
> xpaths in an xml file. But before diving into more functionalities I'm
> just wondering if I did not missed something.
> This work allowed me to explore some nutch aspects but I don't want to
> reinvent the wheel or miss something.
> Albin


Open Source Solutions for Text Engineering


Reply via email to