Hi Albin, You don't have to have a separate plugin for each html structure you want to parse. You can have a single plugin with multiple HTMLParseFilters.
Having a generic extractor with the extraction logic configured in an external file is definitely a good idea and would make a great contribution to the project. In a nutshell, you haven't missed anything and that wheel definitely needs inventing ;-) Best Julien On 25 September 2014 09:24, Albin Vigier <albinsc...@gmail.com> wrote: > Hello everybody, > > I'm just wondering if it is possible to fetch specific metadata with > an existing nutch plugin. > > Let's take an example. > I want to extract some metadata from "div" or "td" tags from html > pages that have specific ids and name them the way I like (this is > done at parser time). > Then, at indexer time, I would use index-metadata (a very good plugin) > to add my custom metadata. > > Currently from what I've seen on the wiki and by quickly analyzing > plugins I suppose I have to code my own plugin each time I've got a > new site (with a new html structure). I've already done that by using > a node walker in a custom htmlParseFilter but the extraction can be a > little bit boring :) > > So on my side i've coded a little plugin that enables me to specify > xpaths in an xml file. But before diving into more functionalities I'm > just wondering if I did not missed something. > This work allowed me to explore some nutch aspects but I don't want to > reinvent the wheel or miss something. > > Albin > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble