Great work! ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message----- From: Albinscode <albinsc...@gmail.com> Reply-To: "dev@nutch.apache.org" <dev@nutch.apache.org> Date: Sunday, October 5, 2014 at 1:09 PM To: "dev@nutch.apache.org" <dev@nutch.apache.org> Subject: Re: Generic xsl parser plugin >@Chris Thank you for your suggestion too. > >As requested I've created the >https://issues.apache.org/jira/browse/NUTCH-1870 and provided a patch. > >Feel free to give me feedbacks. I'll continue work on my branch ;) > >2014-10-03 10:03 GMT+02:00 Albinscode <albinsc...@gmail.com>: >> Hello Sebastian, >> >> Thank you for having taken a look to the global mechanism. >> I've tried to make as simple as possible to focus on "what to extract?". >> >> Currently I've got lots of needs (and so ideas). The code will >> naturally evolve (support of XSLT 2.0) and I would be happy to fully >> give this code to the community. >> >> Of course, I'll create a JIRA and prepare a patch. I'll take the time >> to provide it as clean as possible. >> >> Thank you for your interest. >> >> 2014-10-03 6:59 GMT+02:00 Mattmann, Chris A (3980) >> <chris.a.mattm...@jpl.nasa.gov>: >>> Agree with Sebastian, if we could make this part of Nutch it >>> would be great, as I think it would help us do page scraping >>> a lot better! >>> >>> What do you think Albin? >>> >>> Cheers, >>> Chris >>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Chris Mattmann, Ph.D. >>> Chief Architect >>> Instrument Software and Science Data Systems Section (398) >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>> Office: 168-519, Mailstop: 168-527 >>> Email: chris.a.mattm...@nasa.gov >>> WWW: http://sunset.usc.edu/~mattmann/ >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Adjunct Associate Professor, Computer Science Department >>> University of Southern California, Los Angeles, CA 90089 USA >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> >>> >>> >>> >>> >>> -----Original Message----- >>> From: Sebastian Nagel <wastl.na...@googlemail.com> >>> Reply-To: "dev@nutch.apache.org" <dev@nutch.apache.org> >>> Date: Thursday, October 2, 2014 at 3:03 PM >>> To: "dev@nutch.apache.org" <dev@nutch.apache.org> >>> Subject: Re: Generic xsl parser plugin >>> >>>>Hi Albin, >>>> >>>>the plugin looks very nice! >>>>I like the clean and extensible way how >>>>fields are filled by XPath statements. >>>>To use XSLT functions to do the cleansing >>>>of extracted text (you hardly ever can do without!) >>>>is an excellent idea! >>>> >>>>I hope to find the time soon to look at it more detail >>>>and give it a trial. >>>> >>>>Even more I would like to see the plugin as part of Nutch. >>>>Are you willing to open a Jira for it and provide a patch? >>>> >>>>Thanks a lot, >>>>Sebastian >>>> >>>>On 10/02/2014 10:26 AM, Albinscode wrote: >>>>> Hi all, >>>>> >>>>> I've created two posts on my blog to describe and use the xsl plugin: >>>>> >>>>>http://albinscoding.wordpress.com/2014/09/25/xsl-parser-for-apache-nut >>>>>ch/ >>>>> >>>>>http://albinscoding.wordpress.com/2014/09/17/fast-nutch-configuration/ >>>>> >>>>> The source code is available on >>>>>https://code.google.com/p/nutch-parse-xsl-plugin/. >>>>> I'll update the google code wiki to gather information from my blog. >>>>> >>>>> If you have any comment feel free. >>>>> As I'm currently using it to crawl different web sites related to >>>>>searching friends I'll have lots >>>>> of examples to provide. >>>>> >>>>> Have a nice day! >>>>> >>>>> Albin >>>>> >>>>> 2014-09-25 16:18 GMT+02:00 Albin Vigier <albinsc...@gmail.com >>>>><mailto:albinsc...@gmail.com>>: >>>>> >>>>> Ok, perfect, so I didn't waste my time. I'm finishing my basic >>>>>implementation for my own needs >>>>> and I'll post it to google code or other repo if the community is >>>>>interested. >>>>> I'll work on a small doc too. >>>>> Thank you for your answer. >>>>> >>>>> On Thu, Sep 25, 2014 at 2:49 PM, Julien Nioche >>>>><lists.digitalpeb...@gmail.com >>>>> <mailto:lists.digitalpeb...@gmail.com>> wrote: >>>>> >>>>> Hi Albin, >>>>> >>>>> You don't have to have a separate plugin for each html >>>>>structure you want to parse. You can >>>>> have a single plugin with multiple HTMLParseFilters. >>>>> >>>>> Having a generic extractor with the extraction logic >>>>>configured >>>>>in an external file is >>>>> definitely a good idea and would make a great contribution to >>>>>the project. In a nutshell, >>>>> you haven't missed anything and that wheel definitely needs >>>>>inventing ;-) >>>>> >>>>> Best >>>>> >>>>> Julien >>>>> >>>>> >>>>> On 25 September 2014 09:24, Albin Vigier >>>>><albinsc...@gmail.com >>>>> <mailto:albinsc...@gmail.com>> wrote: >>>>> >>>>> Hello everybody, >>>>> >>>>> I'm just wondering if it is possible to fetch specific >>>>>metadata with >>>>> an existing nutch plugin. >>>>> >>>>> Let's take an example. >>>>> I want to extract some metadata from "div" or "td" tags >>>>>from html >>>>> pages that have specific ids and name them the way I like >>>>>(this is >>>>> done at parser time). >>>>> Then, at indexer time, I would use index-metadata (a very >>>>>good plugin) >>>>> to add my custom metadata. >>>>> >>>>> Currently from what I've seen on the wiki and by quickly >>>>>analyzing >>>>> plugins I suppose I have to code my own plugin each time >>>>>I've got a >>>>> new site (with a new html structure). I've already done >>>>>that by using >>>>> a node walker in a custom htmlParseFilter but the >>>>>extraction can be a >>>>> little bit boring :) >>>>> >>>>> So on my side i've coded a little plugin that enables me >>>>>to >>>>>specify >>>>> xpaths in an xml file. But before diving into more >>>>>functionalities I'm >>>>> just wondering if I did not missed something. >>>>> This work allowed me to explore some nutch aspects but I >>>>>don't want to >>>>> reinvent the wheel or miss something. >>>>> >>>>> Albin >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> * >>>>> *Open Source Solutions for Text Engineering >>>>> >>>>> http://digitalpebble.blogspot.com/ >>>>> http://www.digitalpebble.com >>>>> http://twitter.com/digitalpebble >>>>> >>>>> >>>>> >>>> >>>