Thanks for the pointers Chris On Wed, Apr 6, 2016 at 2:12 PM, Mattmann, Chris A (3980) < chris.a.mattm...@jpl.nasa.gov> wrote:
> Hi Thiago, > > Welcome! > > First thing to check out: > > http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer > > > I would follow that by checking out info on how to use our > Source Code repo: > > http://wiki.apache.org/nutch/UsingGit > > > OK now on to your specific questions: > > > > > On 4/6/16, 8:48 AM, "Thiago Galery" <tgal...@gmail.com> wrote: > > >Dear list, > >I'm a new Nutch Developer and I have a few questions to ask you. > > > >1 - Are there any general guidelines for plugin development (in addition > to > >the ones specified in the wiki guide). > >I looked around github and it seems that many plugins are developed as a > >monolithic piece of code that is attached to / forked from the main Nutch > >repo. I take it that, ideally, plugins should be developed as their own > >separate repositories, so they can be versioned and tested against > >different versions of Nutch. Is there a recommended way to do this ? I'm > >considering using git submodules to add plugin repos as Nutch dependencies > >or else crating symlinks from the plugins folder to the right plugin > >repositories. > > I would recommend plugin develop to be done against the master branch of > nutch, which you can find a cloned copy of here: > > http://github.com/apache/nutch/tree/master > > You can follow this process to submit pull requests to add plugins: > > http://github.com/apache/nutch/#contributing > > > > >2 - As a specific use case for point (1), I have developed a plugin that > >reads some Machine Learning models from a directory. Ideally, I'd like to > >leave the files in the same repository as the plugin, and leave it in a > way > >so that it can be tested, versioned and developed as an independent repo. > > Use a nutch property defined in either $NUTCH/conf/nutch-{default|site}.xml > Then read the property in your plugin via > NutchConfiguration.create().get(“name”) > > If the property references a model file, add a property that lists > (relatively) > the file path, and then read the property assuming that your Nutch *.job > or jar code depending on whether you are running on Hadoop or locally has > access to $NUTCH/conf > > >At the moment, I can just make it work by specifying the path to these > >models in nutch-site.xml, but I wonder whether that directory could be > >accessible by the plugin in some other way (either by some classes in the > >Plugin system or by ivy/ant). Any thoughts ? > > See above. > > > > >3 - Is there any tooling developed by the community to deploy and monitor > >Nutch applications ? At the moment, we have a scrip that deploys Nutch but > >is not robust enough. I see that there's a dockefile. I'm just wondering > if > >it could be used (possibly together with some other tooling) to provision > a > >hadoop cluster which the app runs on top. Another tool to run the crawling > >steps (fetch, parse, index) and provide some form of monitoring would be > >great. > > We have been working on a project called Memex Explorer: > http://github.com/memex-explorer/memex-explorer > > that provides these types of capabilities. Have a look. > > >I hear that this is somehow present in Nutch 2, but I was more > >interested in Nutch 1 (since v2 is not production ready yet, is it?). I > was > >wondering if there are any community recipes for Chef/Puppet/Ansible/Salt > >or some work using Kubernates or Mesos. If anyone has experience with this > >and could give me some pointers, I would greatly appreciate it. > > FYI above. > > > > >4 - At the moment we collect some websites which we extract some metadata > >from, but we don't need to make the results available in a search server > >like Solr or ElasticSearch. Is there any queue or streaming based plugin > >for Nutch, so 'indexing' can be regarding as sending to a queue ? I know > >that Nutch 2 has Gora as an abstraction layer, so maybe this could be a > >gora plug-in, but I'm mainly interested in something for Nutch 1 (or else > >good reasons for moving to Nutch 2). > > Lots of people are interested in this and there is Storm Crawler > that sort of does this, which involves some of the Nutch PMC and > committers. > > Within Nutch there is also work done by my USC masters student and > Nutch PMC member and committer Sujen Shah where he added a publisher > using ActiveMQ Artemis that publishes Nutch events so we can display > what’s up in D3 and JSON. You can see the work here, I intend to commit > it soon: > > https://issues.apache.org/jira/browse/NUTCH-2132 > > > Cheers, > Chris > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Director, Information Retrieval and Data Science Group (IRDS) > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > WWW: http://irds.usc.edu/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >