Hi Thiago, Welcome!
First thing to check out: http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer I would follow that by checking out info on how to use our Source Code repo: http://wiki.apache.org/nutch/UsingGit OK now on to your specific questions: On 4/6/16, 8:48 AM, "Thiago Galery" <tgal...@gmail.com> wrote: >Dear list, >I'm a new Nutch Developer and I have a few questions to ask you. > >1 - Are there any general guidelines for plugin development (in addition to >the ones specified in the wiki guide). >I looked around github and it seems that many plugins are developed as a >monolithic piece of code that is attached to / forked from the main Nutch >repo. I take it that, ideally, plugins should be developed as their own >separate repositories, so they can be versioned and tested against >different versions of Nutch. Is there a recommended way to do this ? I'm >considering using git submodules to add plugin repos as Nutch dependencies >or else crating symlinks from the plugins folder to the right plugin >repositories. I would recommend plugin develop to be done against the master branch of nutch, which you can find a cloned copy of here: http://github.com/apache/nutch/tree/master You can follow this process to submit pull requests to add plugins: http://github.com/apache/nutch/#contributing > >2 - As a specific use case for point (1), I have developed a plugin that >reads some Machine Learning models from a directory. Ideally, I'd like to >leave the files in the same repository as the plugin, and leave it in a way >so that it can be tested, versioned and developed as an independent repo. Use a nutch property defined in either $NUTCH/conf/nutch-{default|site}.xml Then read the property in your plugin via NutchConfiguration.create().get(“name”) If the property references a model file, add a property that lists (relatively) the file path, and then read the property assuming that your Nutch *.job or jar code depending on whether you are running on Hadoop or locally has access to $NUTCH/conf >At the moment, I can just make it work by specifying the path to these >models in nutch-site.xml, but I wonder whether that directory could be >accessible by the plugin in some other way (either by some classes in the >Plugin system or by ivy/ant). Any thoughts ? See above. > >3 - Is there any tooling developed by the community to deploy and monitor >Nutch applications ? At the moment, we have a scrip that deploys Nutch but >is not robust enough. I see that there's a dockefile. I'm just wondering if >it could be used (possibly together with some other tooling) to provision a >hadoop cluster which the app runs on top. Another tool to run the crawling >steps (fetch, parse, index) and provide some form of monitoring would be >great. We have been working on a project called Memex Explorer: http://github.com/memex-explorer/memex-explorer that provides these types of capabilities. Have a look. >I hear that this is somehow present in Nutch 2, but I was more >interested in Nutch 1 (since v2 is not production ready yet, is it?). I was >wondering if there are any community recipes for Chef/Puppet/Ansible/Salt >or some work using Kubernates or Mesos. If anyone has experience with this >and could give me some pointers, I would greatly appreciate it. FYI above. > >4 - At the moment we collect some websites which we extract some metadata >from, but we don't need to make the results available in a search server >like Solr or ElasticSearch. Is there any queue or streaming based plugin >for Nutch, so 'indexing' can be regarding as sending to a queue ? I know >that Nutch 2 has Gora as an abstraction layer, so maybe this could be a >gora plug-in, but I'm mainly interested in something for Nutch 1 (or else >good reasons for moving to Nutch 2). Lots of people are interested in this and there is Storm Crawler that sort of does this, which involves some of the Nutch PMC and committers. Within Nutch there is also work done by my USC masters student and Nutch PMC member and committer Sujen Shah where he added a publisher using ActiveMQ Artemis that publishes Nutch events so we can display what’s up in D3 and JSON. You can see the work here, I intend to commit it soon: https://issues.apache.org/jira/browse/NUTCH-2132 Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++