Hi,

We could reuse RobotsData indeed and refactor it a bit.

Ken, you said you'd be keen to contribute your code for robot parsing as
well - do you think it would be quicker than refactoring Manifold's code? Or
does it do support additional features? What about Droids?

Julien

PS: Anyone attending BerlinBuzzwords next week?


On 2 June 2011 17:57, Karl Wright <daddy...@gmail.com> wrote:

> I don't think it would be hard to peel out the robots parser, although
> obviously it would need refactoring to live in a more standard library
> environment.  If you want to look at it, it is in:
>
>
> https://svn.apache.org/repos/asf/incubator/lcf/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/RobotsManager.java
>
> Look for the static class "RobotsData", around line 299.
>
> Karl
>
>
>
> On Thu, Jun 2, 2011 at 11:35 AM, Julien Nioche
> <lists.digitalpeb...@gmail.com> wrote:
> > Hi Karl,
> >
> > Maybe a good start would be to identify which parts of your crawler could
> be
> > shared and would not take too much effort to be made generic. I haven't
> > looked to the code of the crawler in great details but do you think the
> > robots parser would be a good candidate?
> >
> > Julien
> >
> > On 2 June 2011 16:23, Karl Wright <daddy...@gmail.com> wrote:
> >
> >> Absolutely!
> >> We're a bit thin on active committers at the moment, which will
> >> probably limit our ability to take any highly active roles in your
> >> development process.  But we do have a pile of code which you might be
> >> able to leverage, and once there is common functionality available I
> >> think we'd all prefer to use that rather than home-grown code.
> >>
> >> How would you prefer that we proceed?
> >>
> >> Karl
> >>
> >>
> >> On Thu, Jun 2, 2011 at 11:11 AM, Julien Nioche
> >> <lists.digitalpeb...@gmail.com> wrote:
> >> > Hi guys,
> >> >
> >> > I'd just like to mention Crawler Commons which is a effort between the
> >> > committers of various crawl-related projects (Nutch, Bixo or Heritrix)
> to
> >> > put some basic functionalities in common. We currently have mostly a
> top
> >> > level domain finder and a sitemap parser, but are definitely planning
> to
> >> > have other things there as well, e.g. robots.txt parser, protocol
> handler
> >> > etc...
> >> >
> >> > Would you like to get involved? There are quite a few things that the
> >> > crawler in Manifold could reuse or contribute to.
> >> >
> >> > Best,
> >> >
> >> > Julien
> >> >
> >> > --
> >> > *
> >> > *Open Source Solutions for Text Engineering
> >> >
> >> > http://digitalpebble.blogspot.com/
> >> > http://www.digitalpebble.com
> >> >
> >>
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to