[ANN] Release crawler-commons 0.1

2011-07-06 Thread Julien Nioche
[Apologies for cross-posting] The initial release of crawler-commons is available from : http://code.google.com/p/crawler-commons/downloads/list The purpose of this project is to develop a set of reusable Java components that implement functionality common to any web crawler. These components

Re: CrawlerCommons ManifoldCF

2011-06-03 Thread Julien Nioche
://svn.apache.org/repos/asf/incubator/lcf/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/RobotsManager.java Look for the static class RobotsData, around line 299. Karl On Thu, Jun 2, 2011 at 11:35 AM, Julien Nioche lists.digitalpeb

CrawlerCommons ManifoldCF

2011-06-02 Thread Julien Nioche
Hi guys, I'd just like to mention Crawler Commons which is a effort between the committers of various crawl-related projects (Nutch, Bixo or Heritrix) to put some basic functionalities in common. We currently have mostly a top level domain finder and a sitemap parser, but are definitely planning

Re: CrawlerCommons ManifoldCF

2011-06-02 Thread Julien Nioche
there is common functionality available I think we'd all prefer to use that rather than home-grown code. How would you prefer that we proceed? Karl On Thu, Jun 2, 2011 at 11:11 AM, Julien Nioche lists.digitalpeb...@gmail.com wrote: Hi guys, I'd just like to mention Crawler Commons which