Thanks for the pointers Chris

On Wed, Apr 6, 2016 at 2:12 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hi Thiago,
>
> Welcome!
>
> First thing to check out:
>
> http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer
>
>
> I would follow that by checking out info on how to use our
> Source Code repo:
>
> http://wiki.apache.org/nutch/UsingGit
>
>
> OK now on to your specific questions:
>
>
>
>
> On 4/6/16, 8:48 AM, "Thiago Galery" <tgal...@gmail.com> wrote:
>
> >Dear list,
> >I'm a new Nutch Developer and I have a few questions to ask you.
> >
> >1 - Are there any general guidelines for plugin development (in addition
> to
> >the ones specified in the wiki guide).
> >I looked around github and it seems that many plugins are developed as a
> >monolithic piece of code that is attached to / forked from the main Nutch
> >repo. I take it that, ideally, plugins should be developed as their own
> >separate repositories, so they can be versioned and tested against
> >different versions of Nutch. Is there a recommended way to do this ? I'm
> >considering using git submodules to add plugin repos as Nutch dependencies
> >or else crating symlinks from the plugins folder to the right plugin
> >repositories.
>
> I would recommend plugin develop to be done against the master branch of
> nutch, which you can find a cloned copy of here:
>
> http://github.com/apache/nutch/tree/master
>
> You can follow this process to submit pull requests to add plugins:
>
> http://github.com/apache/nutch/#contributing
>
> >
> >2 - As a specific use case for point (1), I have developed a plugin that
> >reads some Machine Learning models from a directory. Ideally, I'd like to
> >leave the files in the same repository as the plugin, and leave it in a
> way
> >so that it can be tested, versioned and developed as an independent repo.
>
> Use a nutch property defined in either $NUTCH/conf/nutch-{default|site}.xml
> Then read the property in your plugin via
> NutchConfiguration.create().get(“name”)
>
> If the property references a model file, add a property that lists
> (relatively)
> the file path, and then read the property assuming that your Nutch *.job
> or jar code depending on whether you are running on Hadoop or locally has
> access to $NUTCH/conf
>
> >At the moment, I can just make it work by specifying the path to these
> >models in nutch-site.xml, but I wonder whether that directory could be
> >accessible by the plugin in some other way (either by some classes in the
> >Plugin system or by ivy/ant). Any thoughts ?
>
> See above.
>
> >
> >3 - Is there any tooling developed by the community to deploy and monitor
> >Nutch applications ? At the moment, we have a scrip that deploys Nutch but
> >is not robust enough. I see that there's a dockefile. I'm just wondering
> if
> >it could be used (possibly together with some other tooling) to provision
> a
> >hadoop cluster which the app runs on top. Another tool to run the crawling
> >steps (fetch, parse, index) and provide some form of monitoring would be
> >great.
>
> We have been working on a project called Memex Explorer:
> http://github.com/memex-explorer/memex-explorer
>
> that provides these types of capabilities. Have a look.
>
> >I hear that this is somehow present in Nutch 2, but I was more
> >interested in Nutch 1 (since v2 is not production ready yet, is it?). I
> was
> >wondering if there are any community recipes for Chef/Puppet/Ansible/Salt
> >or some work using Kubernates or Mesos. If anyone has experience with this
> >and could give me some pointers, I would greatly appreciate it.
>
> FYI above.
>
> >
> >4 - At the moment we collect some websites which we extract some metadata
> >from, but we don't need to make the results available in a search server
> >like Solr or ElasticSearch. Is there any queue or streaming based plugin
> >for Nutch, so 'indexing' can be regarding as sending to a queue ? I know
> >that Nutch 2 has Gora as an abstraction layer, so maybe this could be a
> >gora plug-in, but I'm mainly interested in something for Nutch 1 (or else
> >good reasons for moving to Nutch 2).
>
> Lots of people are interested in this and there is Storm Crawler
> that sort of does this, which involves some of the Nutch PMC and
> committers.
>
> Within Nutch there is also work done by my USC masters student and
> Nutch PMC member and committer Sujen Shah where he added a publisher
> using ActiveMQ Artemis that publishes Nutch events so we can display
> what’s up in D3 and JSON. You can see the work here, I intend to commit
> it soon:
>
> https://issues.apache.org/jira/browse/NUTCH-2132
>
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Director, Information Retrieval and Data Science Group (IRDS)
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> WWW: http://irds.usc.edu/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>

Reply via email to