Re: Apache Nutch being used at National Snow and Ice Data Center: ESIP Federation

Mattmann, Chris A (388J) Wed, 18 Jul 2012 21:49:07 -0700

Hi Ian,

On Jul 18, 2012, at 10:01 AM, Ian Truslove wrote:


> Chris: message received - I signed up :)

Thanks for doing this!

> 
> As part of Ruth's Libre project (http://nsidc.org/libre/) we are using
> Nutch to find various types of XML data.  We're targeting our search at
> geospatial data, and more specifically cryospheric data, but the tools
> will remain more broadly applicable.  Specifically we are looking for ESIP
> data casts, collection casts, service casts, and ESIP Discovery OpenSearch
> services (all the specs are in
> http://wiki.esipfed.org/index.php/Discovery_Cluster).  These XML documents
> and services are characterizable through fairly simple means such as XML
> namespaces.
> 
> We are currently developing against the Nutch 1.4 tarball distribution
> (SVN HEAD was moving quicker than our configuration could keep up with)
> and plugging into a standalone Solr instance.
> 
> What we have done to date is do some basic configuration work, set the
> code up to play nice(-ish) with Eclipse, our internal SVN, and our
> CI/deployment system, and write some plugins to help us find our various
> XML docs.  We wrote a pair to extract and index the full raw XML content
> of the source document, extending the HtmlParseFilter and IndexingFilter
> respectively.  XML (and of course HTML too) are just wrapped within a
> CDATA section (and CDATA sections within the document are just removed),
> and indexed as a big text blob in Solr.  We can do naive text matching and
> are having success extracting the URLs of the data feeds we're after.
> 
> We also wrote a pair of plugins to keep track of the original index date
> of a document (the overarching use case is to determine documents that are
> newly found).  We used the ScoringFilter and IndexingFilter for those.
> 
> Planned work includes extracting data from the XML before indexing and
> using Solr fields more effectively, indexing GCMD keywords, simple spatial
> subsetting, and tweaking the ranking algorithms to do a broad search to
> identify good sites for deep data searches.
> 
> Thanks for the interest - it's been a fun project to work on so far, and
> I'm sure we'd be happy to talk more or provide more details.

Super awesome! 

Well if you get around to it, feel free to:

1. file JIRA issues at our JIRA issue tracker 
https://issues.apache.org/jira/browse/NUTCH identifying, as incrementally and 
as easily revertible and small as possible your changes.
2. create patch files and attach them to our JIRA issue tracker for the issues 
that you create in #1
3. work with a committer here in Nutch to get your patches contributed. Usually 
having unit tests, code that conforms to the rest of the Nutch style (e.g., no 
tabs, etc.), are all good helpers. Doug Cutting used to say if he could apply a 
few of your patches without modification, then you are well on the track 
towards getting your code included in the project.

Thanks much! Any questions, let me or any of the rest of the Nutch devs that 
hang out here know.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Re: Apache Nutch being used at National Snow and Ice Data Center: ESIP Federation

Reply via email to