Hi Thiago,

Sorry for the top post:

1. Yes you could do conf/models, and/or an HDFS url, either one.
The conf directory is packaged up when you create a *.job file
for Hadoop by running ant job. That said, if your job jar includes
100-1GB model files that’s how big your *.job will be. A better way
would probably be to pre-stage the models on to HDFS, and ref
them via an HDFS url.

2. Yes MEMEX explorer is right now in hiatus. It was a proof of
feasibility that we used in the DARPA MEMEX program and it already
takes care of a lot of the stuff you were talking about with Salt,
and e.g., Docker/Vagrant and Nutch. So that’s why I pointed you there
it’s certainly something to build off of rather than re-invent.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS)
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++










On 4/8/16, 2:25 AM, "Thiago Galery" <tgal...@gmail.com> wrote:

>Hi Chris, thanks for the response, here are some elaborations of my initial
>questions on the basis of your reply.
>
>On Wed, Apr 6, 2016 at 2:12 PM, Mattmann, Chris A (3980) <
>chris.a.mattm...@jpl.nasa.gov> wrote:
>
>> Hi Thiago,
>>
>> Welcome!
>>
>> First thing to check out:
>>
>> http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer
>>
>>
>> I would follow that by checking out info on how to use our
>> Source Code repo:
>>
>> http://wiki.apache.org/nutch/UsingGit
>>
>>
>> OK now on to your specific questions:
>>
>>
>>
>>
>> On 4/6/16, 8:48 AM, "Thiago Galery" <tgal...@gmail.com> wrote:
>>
>> >Dear list,
>> >I'm a new Nutch Developer and I have a few questions to ask you.
>> >
>> >1 - Are there any general guidelines for plugin development (in addition
>> to
>> >the ones specified in the wiki guide).
>> >I looked around github and it seems that many plugins are developed as a
>> >monolithic piece of code that is attached to / forked from the main Nutch
>> >repo. I take it that, ideally, plugins should be developed as their own
>> >separate repositories, so they can be versioned and tested against
>> >different versions of Nutch. Is there a recommended way to do this ? I'm
>> >considering using git submodules to add plugin repos as Nutch dependencies
>> >or else crating symlinks from the plugins folder to the right plugin
>> >repositories.
>>
>> I would recommend plugin develop to be done against the master branch of
>> nutch, which you can find a cloned copy of here:
>>
>> http://github.com/apache/nutch/tree/master
>>
>> You can follow this process to submit pull requests to add plugins:
>>
>> http://github.com/apache/nutch/#contributing
>>
>> >
>> >2 - As a specific use case for point (1), I have developed a plugin that
>> >reads some Machine Learning models from a directory. Ideally, I'd like to
>> >leave the files in the same repository as the plugin, and leave it in a
>> way
>> >so that it can be tested, versioned and developed as an independent repo.
>>
>> Use a nutch property defined in either $NUTCH/conf/nutch-{default|site}.xml
>> Then read the property in your plugin via
>> NutchConfiguration.create().get(“name”)
>>
>> If the property references a model file, add a property that lists
>> (relatively)
>> the file path, and then read the property assuming that your Nutch *.job
>> or jar code depending on whether you are running on Hadoop or locally has
>> access to $NUTCH/conf
>>
>
>
>Could you elaborate on this a bit more. At the moment I'm specifying the
>full path or the models,
>this works well on local mode, but might raise problems when running on a
>hadoop cluster.
>I understand that the path should be specified relatively, but I'm not sure
>relative to what, that is,
>if the job file has access to the conf folder, should I put the models
>inside conf and just add the property
>models.folder = conf/models ? I imagine that another option is to use a
>hdfs url for the models location,
>would that work ?
>
>
>
>> >At the moment, I can just make it work by specifying the path to these
>> >models in nutch-site.xml, but I wonder whether that directory could be
>> >accessible by the plugin in some other way (either by some classes in the
>> >Plugin system or by ivy/ant). Any thoughts ?
>>
>> See above.
>>
>> >
>> >3 - Is there any tooling developed by the community to deploy and monitor
>> >Nutch applications ? At the moment, we have a scrip that deploys Nutch but
>> >is not robust enough. I see that there's a dockefile. I'm just wondering
>> if
>> >it could be used (possibly together with some other tooling) to provision
>> a
>> >hadoop cluster which the app runs on top. Another tool to run the crawling
>> >steps (fetch, parse, index) and provide some form of monitoring would be
>> >great.
>>
>> We have been working on a project called Memex Explorer:
>> http://github.com/memex-explorer/memex-explorer
>>
>
>
>Memex explorer seems to be really interesting !!! However, I had some
>issues (tests not passing, redis not runnning, some screens unavailable).
>On the github page, it says that the project is not maintained. I'd be
>happy to fix bugs and contribute, but if the project is just gonna be
>ditched, then I'd be less inclined to do so.
>Does anyone know what the plans for memex are ?
>
>
>> that provides these types of capabilities. Have a look.
>>
>> >I hear that this is somehow present in Nutch 2, but I was more
>> >interested in Nutch 1 (since v2 is not production ready yet, is it?). I
>> was
>> >wondering if there are any community recipes for Chef/Puppet/Ansible/Salt
>> >or some work using Kubernates or Mesos. If anyone has experience with this
>> >and could give me some pointers, I would greatly appreciate it.
>>
>> FYI above.
>>
>> >
>> >4 - At the moment we collect some websites which we extract some metadata
>> >from, but we don't need to make the results available in a search server
>> >like Solr or ElasticSearch. Is there any queue or streaming based plugin
>> >for Nutch, so 'indexing' can be regarding as sending to a queue ? I know
>> >that Nutch 2 has Gora as an abstraction layer, so maybe this could be a
>> >gora plug-in, but I'm mainly interested in something for Nutch 1 (or else
>> >good reasons for moving to Nutch 2).
>>
>> Lots of people are interested in this and there is Storm Crawler
>> that sort of does this, which involves some of the Nutch PMC and
>> committers.
>>
>> Within Nutch there is also work done by my USC masters student and
>> Nutch PMC member and committer Sujen Shah where he added a publisher
>> using ActiveMQ Artemis that publishes Nutch events so we can display
>> what’s up in D3 and JSON. You can see the work here, I intend to commit
>> it soon:
>>
>> https://issues.apache.org/jira/browse/NUTCH-2132
>>
>>
>> Cheers,
>> Chris
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Director, Information Retrieval and Data Science Group (IRDS)
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> WWW: http://irds.usc.edu/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>

Reply via email to