Agree with Sebastian, if we could make this part of Nutch it would be great, as I think it would help us do page scraping a lot better!
What do you think Albin? Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Sebastian Nagel <wastl.na...@googlemail.com> Reply-To: "dev@nutch.apache.org" <dev@nutch.apache.org> Date: Thursday, October 2, 2014 at 3:03 PM To: "dev@nutch.apache.org" <dev@nutch.apache.org> Subject: Re: Generic xsl parser plugin >Hi Albin, > >the plugin looks very nice! >I like the clean and extensible way how >fields are filled by XPath statements. >To use XSLT functions to do the cleansing >of extracted text (you hardly ever can do without!) >is an excellent idea! > >I hope to find the time soon to look at it more detail >and give it a trial. > >Even more I would like to see the plugin as part of Nutch. >Are you willing to open a Jira for it and provide a patch? > >Thanks a lot, >Sebastian > >On 10/02/2014 10:26 AM, Albinscode wrote: >> Hi all, >> >> I've created two posts on my blog to describe and use the xsl plugin: >> >>http://albinscoding.wordpress.com/2014/09/25/xsl-parser-for-apache-nutch/ >> http://albinscoding.wordpress.com/2014/09/17/fast-nutch-configuration/ >> >> The source code is available on >>https://code.google.com/p/nutch-parse-xsl-plugin/. >> I'll update the google code wiki to gather information from my blog. >> >> If you have any comment feel free. >> As I'm currently using it to crawl different web sites related to >>searching friends I'll have lots >> of examples to provide. >> >> Have a nice day! >> >> Albin >> >> 2014-09-25 16:18 GMT+02:00 Albin Vigier <albinsc...@gmail.com >><mailto:albinsc...@gmail.com>>: >> >> Ok, perfect, so I didn't waste my time. I'm finishing my basic >>implementation for my own needs >> and I'll post it to google code or other repo if the community is >>interested. >> I'll work on a small doc too. >> Thank you for your answer. >> >> On Thu, Sep 25, 2014 at 2:49 PM, Julien Nioche >><lists.digitalpeb...@gmail.com >> <mailto:lists.digitalpeb...@gmail.com>> wrote: >> >> Hi Albin, >> >> You don't have to have a separate plugin for each html >>structure you want to parse. You can >> have a single plugin with multiple HTMLParseFilters. >> >> Having a generic extractor with the extraction logic configured >>in an external file is >> definitely a good idea and would make a great contribution to >>the project. In a nutshell, >> you haven't missed anything and that wheel definitely needs >>inventing ;-) >> >> Best >> >> Julien >> >> >> On 25 September 2014 09:24, Albin Vigier <albinsc...@gmail.com >> <mailto:albinsc...@gmail.com>> wrote: >> >> Hello everybody, >> >> I'm just wondering if it is possible to fetch specific >>metadata with >> an existing nutch plugin. >> >> Let's take an example. >> I want to extract some metadata from "div" or "td" tags >>from html >> pages that have specific ids and name them the way I like >>(this is >> done at parser time). >> Then, at indexer time, I would use index-metadata (a very >>good plugin) >> to add my custom metadata. >> >> Currently from what I've seen on the wiki and by quickly >>analyzing >> plugins I suppose I have to code my own plugin each time >>I've got a >> new site (with a new html structure). I've already done >>that by using >> a node walker in a custom htmlParseFilter but the >>extraction can be a >> little bit boring :) >> >> So on my side i've coded a little plugin that enables me to >>specify >> xpaths in an xml file. But before diving into more >>functionalities I'm >> just wondering if I did not missed something. >> This work allowed me to explore some nutch aspects but I >>don't want to >> reinvent the wheel or miss something. >> >> Albin >> >> >> >> >> -- >> * >> *Open Source Solutions for Text Engineering >> >> http://digitalpebble.blogspot.com/ >> http://www.digitalpebble.com >> http://twitter.com/digitalpebble >> >> >> >