Hi, We could reuse RobotsData indeed and refactor it a bit.
Ken, you said you'd be keen to contribute your code for robot parsing as well - do you think it would be quicker than refactoring Manifold's code? Or does it do support additional features? What about Droids? Julien PS: Anyone attending BerlinBuzzwords next week? On 2 June 2011 17:57, Karl Wright <daddy...@gmail.com> wrote: > I don't think it would be hard to peel out the robots parser, although > obviously it would need refactoring to live in a more standard library > environment. If you want to look at it, it is in: > > > https://svn.apache.org/repos/asf/incubator/lcf/trunk/connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/RobotsManager.java > > Look for the static class "RobotsData", around line 299. > > Karl > > > > On Thu, Jun 2, 2011 at 11:35 AM, Julien Nioche > <lists.digitalpeb...@gmail.com> wrote: > > Hi Karl, > > > > Maybe a good start would be to identify which parts of your crawler could > be > > shared and would not take too much effort to be made generic. I haven't > > looked to the code of the crawler in great details but do you think the > > robots parser would be a good candidate? > > > > Julien > > > > On 2 June 2011 16:23, Karl Wright <daddy...@gmail.com> wrote: > > > >> Absolutely! > >> We're a bit thin on active committers at the moment, which will > >> probably limit our ability to take any highly active roles in your > >> development process. But we do have a pile of code which you might be > >> able to leverage, and once there is common functionality available I > >> think we'd all prefer to use that rather than home-grown code. > >> > >> How would you prefer that we proceed? > >> > >> Karl > >> > >> > >> On Thu, Jun 2, 2011 at 11:11 AM, Julien Nioche > >> <lists.digitalpeb...@gmail.com> wrote: > >> > Hi guys, > >> > > >> > I'd just like to mention Crawler Commons which is a effort between the > >> > committers of various crawl-related projects (Nutch, Bixo or Heritrix) > to > >> > put some basic functionalities in common. We currently have mostly a > top > >> > level domain finder and a sitemap parser, but are definitely planning > to > >> > have other things there as well, e.g. robots.txt parser, protocol > handler > >> > etc... > >> > > >> > Would you like to get involved? There are quite a few things that the > >> > crawler in Manifold could reuse or contribute to. > >> > > >> > Best, > >> > > >> > Julien > >> > > >> > -- > >> > * > >> > *Open Source Solutions for Text Engineering > >> > > >> > http://digitalpebble.blogspot.com/ > >> > http://www.digitalpebble.com > >> > > >> > > > > > > > > -- > > * > > *Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com > > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com