(...apologies for the cross posting...) The Apache Nutch project is pleased to announce the release of Apache Nutch 1.4. The release contents have been pushed out to the main Apache release site so the releases should be available as soon as the mirrors get the syncs.
Apache Nutch is an extensible framework for building out large-scale web-based search. Layered on top of fellow Apache projects Hadoop, Lucene/Solr, and Tika, Nutch provides an out of the box platform for fetching web pages, pdf files, word documents, and more. Nutch parses the content and its relevant information, indexes its metadata, and makes it available for efficient query and retrieval over modern Internet protocols. Apache Nutch 1.4 contains a number of improvements and bug fixes. Details can be found in the changes file: http://www.apache.org/dist/nutch/CHANGES-1.4.txt Apache Nutch is available in source and binary form from the following download page: http://www.apache.org/dyn/closer.cgi/nutch/ Nutch is also available as a Jar dependency from the Central repository: http://repo2.maven.org/maven2/org/apache/nutch/ In the initial 48 hours, the release may not be available on all mirrors. When downloading from a mirror site, please remember to verify the downloads using signatures found on the Apache site: http://www.apache.org/dist/nutch/KEYS For more information on Apache Nutch, visit the project home page: http://nutch.apache.org -- Chris Mattmann (on behalf of the Apache Nutch community) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++