Hi Thiago, Sorry for the top post:
1. Yes you could do conf/models, and/or an HDFS url, either one. The conf directory is packaged up when you create a *.job file for Hadoop by running ant job. That said, if your job jar includes 100-1GB model files that’s how big your *.job will be. A better way would probably be to pre-stage the models on to HDFS, and ref them via an HDFS url. 2. Yes MEMEX explorer is right now in hiatus. It was a proof of feasibility that we used in the DARPA MEMEX program and it already takes care of a lot of the stuff you were talking about with Salt, and e.g., Docker/Vagrant and Nutch. So that’s why I pointed you there it’s certainly something to build off of rather than re-invent. Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA WWW: http://irds.usc.edu/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ On 4/8/16, 2:25 AM, "Thiago Galery" <tgal...@gmail.com> wrote: >Hi Chris, thanks for the response, here are some elaborations of my initial >questions on the basis of your reply. > >On Wed, Apr 6, 2016 at 2:12 PM, Mattmann, Chris A (3980) < >chris.a.mattm...@jpl.nasa.gov> wrote: > >> Hi Thiago, >> >> Welcome! >> >> First thing to check out: >> >> http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer >> >> >> I would follow that by checking out info on how to use our >> Source Code repo: >> >> http://wiki.apache.org/nutch/UsingGit >> >> >> OK now on to your specific questions: >> >> >> >> >> On 4/6/16, 8:48 AM, "Thiago Galery" <tgal...@gmail.com> wrote: >> >> >Dear list, >> >I'm a new Nutch Developer and I have a few questions to ask you. >> > >> >1 - Are there any general guidelines for plugin development (in addition >> to >> >the ones specified in the wiki guide). >> >I looked around github and it seems that many plugins are developed as a >> >monolithic piece of code that is attached to / forked from the main Nutch >> >repo. I take it that, ideally, plugins should be developed as their own >> >separate repositories, so they can be versioned and tested against >> >different versions of Nutch. Is there a recommended way to do this ? I'm >> >considering using git submodules to add plugin repos as Nutch dependencies >> >or else crating symlinks from the plugins folder to the right plugin >> >repositories. >> >> I would recommend plugin develop to be done against the master branch of >> nutch, which you can find a cloned copy of here: >> >> http://github.com/apache/nutch/tree/master >> >> You can follow this process to submit pull requests to add plugins: >> >> http://github.com/apache/nutch/#contributing >> >> > >> >2 - As a specific use case for point (1), I have developed a plugin that >> >reads some Machine Learning models from a directory. Ideally, I'd like to >> >leave the files in the same repository as the plugin, and leave it in a >> way >> >so that it can be tested, versioned and developed as an independent repo. >> >> Use a nutch property defined in either $NUTCH/conf/nutch-{default|site}.xml >> Then read the property in your plugin via >> NutchConfiguration.create().get(“name”) >> >> If the property references a model file, add a property that lists >> (relatively) >> the file path, and then read the property assuming that your Nutch *.job >> or jar code depending on whether you are running on Hadoop or locally has >> access to $NUTCH/conf >> > > >Could you elaborate on this a bit more. At the moment I'm specifying the >full path or the models, >this works well on local mode, but might raise problems when running on a >hadoop cluster. >I understand that the path should be specified relatively, but I'm not sure >relative to what, that is, >if the job file has access to the conf folder, should I put the models >inside conf and just add the property >models.folder = conf/models ? I imagine that another option is to use a >hdfs url for the models location, >would that work ? > > > >> >At the moment, I can just make it work by specifying the path to these >> >models in nutch-site.xml, but I wonder whether that directory could be >> >accessible by the plugin in some other way (either by some classes in the >> >Plugin system or by ivy/ant). Any thoughts ? >> >> See above. >> >> > >> >3 - Is there any tooling developed by the community to deploy and monitor >> >Nutch applications ? At the moment, we have a scrip that deploys Nutch but >> >is not robust enough. I see that there's a dockefile. I'm just wondering >> if >> >it could be used (possibly together with some other tooling) to provision >> a >> >hadoop cluster which the app runs on top. Another tool to run the crawling >> >steps (fetch, parse, index) and provide some form of monitoring would be >> >great. >> >> We have been working on a project called Memex Explorer: >> http://github.com/memex-explorer/memex-explorer >> > > >Memex explorer seems to be really interesting !!! However, I had some >issues (tests not passing, redis not runnning, some screens unavailable). >On the github page, it says that the project is not maintained. I'd be >happy to fix bugs and contribute, but if the project is just gonna be >ditched, then I'd be less inclined to do so. >Does anyone know what the plans for memex are ? > > >> that provides these types of capabilities. Have a look. >> >> >I hear that this is somehow present in Nutch 2, but I was more >> >interested in Nutch 1 (since v2 is not production ready yet, is it?). I >> was >> >wondering if there are any community recipes for Chef/Puppet/Ansible/Salt >> >or some work using Kubernates or Mesos. If anyone has experience with this >> >and could give me some pointers, I would greatly appreciate it. >> >> FYI above. >> >> > >> >4 - At the moment we collect some websites which we extract some metadata >> >from, but we don't need to make the results available in a search server >> >like Solr or ElasticSearch. Is there any queue or streaming based plugin >> >for Nutch, so 'indexing' can be regarding as sending to a queue ? I know >> >that Nutch 2 has Gora as an abstraction layer, so maybe this could be a >> >gora plug-in, but I'm mainly interested in something for Nutch 1 (or else >> >good reasons for moving to Nutch 2). >> >> Lots of people are interested in this and there is Storm Crawler >> that sort of does this, which involves some of the Nutch PMC and >> committers. >> >> Within Nutch there is also work done by my USC masters student and >> Nutch PMC member and committer Sujen Shah where he added a publisher >> using ActiveMQ Artemis that publishes Nutch events so we can display >> what’s up in D3 and JSON. You can see the work here, I intend to commit >> it soon: >> >> https://issues.apache.org/jira/browse/NUTCH-2132 >> >> >> Cheers, >> Chris >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: chris.a.mattm...@nasa.gov >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Director, Information Retrieval and Data Science Group (IRDS) >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> WWW: http://irds.usc.edu/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >>