[Apologies for cross-posting]
The initial release of crawler-commons is available from :
http://code.google.com/p/crawler-commons/downloads/list
The purpose of this project is to develop a set of reusable Java components
that implement functionality common to any web crawler. These components
://svn.apache.org/repos/asf/incubator/lcf/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/RobotsManager.java
Look for the static class RobotsData, around line 299.
Karl
On Thu, Jun 2, 2011 at 11:35 AM, Julien Nioche
lists.digitalpeb
Hi guys,
I'd just like to mention Crawler Commons which is a effort between the
committers of various crawl-related projects (Nutch, Bixo or Heritrix) to
put some basic functionalities in common. We currently have mostly a top
level domain finder and a sitemap parser, but are definitely planning
there is common functionality available I
think we'd all prefer to use that rather than home-grown code.
How would you prefer that we proceed?
Karl
On Thu, Jun 2, 2011 at 11:11 AM, Julien Nioche
lists.digitalpeb...@gmail.com wrote:
Hi guys,
I'd just like to mention Crawler Commons which