Re: CrawlerCommons & ManifoldCF

Julien Nioche Thu, 02 Jun 2011 08:36:10 -0700

Hi Karl,

Maybe a good start would be to identify which parts of your crawler could be
shared and would not take too much effort to be made generic. I haven't
looked to the code of the crawler in great details but do you think the
robots parser would be a good candidate?


Julien

On 2 June 2011 16:23, Karl Wright <daddy...@gmail.com> wrote:

> Absolutely!
> We're a bit thin on active committers at the moment, which will
> probably limit our ability to take any highly active roles in your
> development process.  But we do have a pile of code which you might be
> able to leverage, and once there is common functionality available I
> think we'd all prefer to use that rather than home-grown code.
>
> How would you prefer that we proceed?
>
> Karl
>
>
> On Thu, Jun 2, 2011 at 11:11 AM, Julien Nioche
> <lists.digitalpeb...@gmail.com> wrote:
> > Hi guys,
> >
> > I'd just like to mention Crawler Commons which is a effort between the
> > committers of various crawl-related projects (Nutch, Bixo or Heritrix) to
> > put some basic functionalities in common. We currently have mostly a top
> > level domain finder and a sitemap parser, but are definitely planning to
> > have other things there as well, e.g. robots.txt parser, protocol handler
> > etc...
> >
> > Would you like to get involved? There are quite a few things that the
> > crawler in Manifold could reuse or contribute to.
> >
> > Best,
> >
> > Julien
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: CrawlerCommons & ManifoldCF

Reply via email to