Hi Nima Thanks for reminding me about this JIRA issue, it hasn't been commented on for some time and I'd forgotten about it. Judging by the discussion on NUTCH-978 <https://issues.apache.org/jira/browse/NUTCH-978> things got stuck when Emmanuel tried to get in touch with Emir (who in the meantime seems to have stopped using Nutch - see http://www.atlantbh.com/book-review-web-crawling-and-data-mining-with-apache-nutch/ ).
It would be a good thing to get in touch with him indeed, alternatively Albin's plugin could be a good starting point. There clearly is a need for such a functionality and quite a few people keen to make it happen. Thanks Julien On 25 September 2014 18:19, Nima Falaki <nfal...@popsugar.com> wrote: > And the reason why I think this is because of this ticket (Look at the > conversation at the bottom between Emmanuel and Lewis John) > > https://issues.apache.org/jira/browse/NUTCH-978 > > On Thu, Sep 25, 2014 at 8:44 AM, Nima Falaki <nfal...@popsugar.com> wrote: > >> Hi Julien: >> >> I was under the impression that the nutch community was going to use a >> generic xls parser? This one. >> http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ Is >> the nutch community going to use this? >> >> >> >> On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche < >> lists.digitalpeb...@gmail.com> wrote: >> >>> Hi Albin, >>> >>> You don't have to have a separate plugin for each html structure you >>> want to parse. You can have a single plugin with multiple HTMLParseFilters. >>> >>> Having a generic extractor with the extraction logic configured in an >>> external file is definitely a good idea and would make a great contribution >>> to the project. In a nutshell, you haven't missed anything and that wheel >>> definitely needs inventing ;-) >>> >>> Best >>> >>> Julien >>> >>> >>> On 25 September 2014 09:24, Albin Vigier <albinsc...@gmail.com> wrote: >>> >>>> Hello everybody, >>>> >>>> I'm just wondering if it is possible to fetch specific metadata with >>>> an existing nutch plugin. >>>> >>>> Let's take an example. >>>> I want to extract some metadata from "div" or "td" tags from html >>>> pages that have specific ids and name them the way I like (this is >>>> done at parser time). >>>> Then, at indexer time, I would use index-metadata (a very good plugin) >>>> to add my custom metadata. >>>> >>>> Currently from what I've seen on the wiki and by quickly analyzing >>>> plugins I suppose I have to code my own plugin each time I've got a >>>> new site (with a new html structure). I've already done that by using >>>> a node walker in a custom htmlParseFilter but the extraction can be a >>>> little bit boring :) >>>> >>>> So on my side i've coded a little plugin that enables me to specify >>>> xpaths in an xml file. But before diving into more functionalities I'm >>>> just wondering if I did not missed something. >>>> This work allowed me to explore some nutch aspects but I don't want to >>>> reinvent the wheel or miss something. >>>> >>>> Albin >>>> >>> >>> >>> >>> -- >>> >>> Open Source Solutions for Text Engineering >>> >>> http://digitalpebble.blogspot.com/ >>> http://www.digitalpebble.com >>> http://twitter.com/digitalpebble >>> >> >> >> >> -- >> >> >> >> Nima Falaki >> Software Engineer >> nfal...@popsugar.com >> >> > > > -- > > > > Nima Falaki > Software Engineer > nfal...@popsugar.com > > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble