Re: [Nutch-dev] Apache Droids - standalone crawl framework

rubdabadub Tue, 20 Feb 2007 14:05:24 -0800

On 2/20/07, Renaud Richardet <[EMAIL PROTECTED]> wrote:
> Hi Thorsten,
>
> I have quickly looked at the Droid code, and was wondering why you don't
> want to completely reuse the Nutch plugin API in Droid. This way, you
> could reuse the Nutch parse-* plugins without modifications. Just trying
> to understand...



hmm.. interesting .. I am not fully on board with Nutch. But how would
the end output
of such crawl be.. as i.e ..

HTML file 1 crawl --> parse-html --> parsing rule say.. "pick up <a
href=> tags"  --> dump it to a predefined text file (i.e. 1 "ahref
tag" per line or whatever based on template or something) ... or?? cos
the current Nutch saves it in bin format.. so I am trying to
understand here as well...

Regards


> > I have finished a first version of an Apache Labs project called Apache
> > Droids and checked it in. All Apache committer have write access there
> > so fell all free to enhance the code. Like said all committer have write
> > access on droids and everybody is welcome to join the effort.
> >
> > Please see http://svn.apache.org/repos/asf/labs/droids/README.TXT to get
> > started.
> >
> > What is this?
> >  -------------
> >  Droids aims to be an intelligent standalone crawl framework that
> >  automatically seeks out relevant online information based on the user's
> >  specifications. The core is a simple crawler which can be
> >  easily extended by plugins. So if a project/app needs special
> > processing for a crawled url one can easily write a plugin to implement
> > the functionality.
> >
> > Why was it created?
> >  -------------------
> >  Mainly because of personal curiosity:
> >  The background of this work is that Cocoon trunk does not provide a
> >  crawler anymore and Forrest is based on it, meaning we cannot update
> >  anymore till we found a crawler replacement. Getting more involved in
> >  Solr and Nutch I see request for a generic standalone crawler.
> >
> > How does the first implementation crawler-x-m02y07 looks like?
> >  --------------------------------------------------------------
> >  I took nutch, ripped out and modified the awesome plugin/extension
> > framework to create the droid core.
> >  Now I could implement all funtionality in plugins. Droids should make
> > it very easy to extend it.
> >  I wrote some proof of concept plugins that make up crawler-x-m02y07 to
> >  - crawl an url  (CrawlerImpl)
> >  - extract links (only <a/> ATM) via a parse-html plugin
> >  - merge them with the queue
> >  - save or print out the crawled pages.
> >
> > Why crawler-x-m02y07?
> >  ---------------------
> >  Droids tries to be a framework for different droids.
> >  The first implementation is a "crawler" with the name "x"
> >  first archived in the second "m"onth of the "y"ear 20"07"
> >
> > Next steps
> > ----------
> > I still need to write a droids factory, that one can write
> > another implementation then Xm02y07 as crawler plugin and invoke it via
> > the Cli. Another todo is to implement a dependency system like Apache
> > Ivy instead to copycat the nutch approach.
> >
> > Open questions for nutch
> > ------------------------
> > Exists interest for a nutch crawler plugin to utilize native nutch
> > plugins and imitate the nutch crawler in Droids? Is nutch interested in
> > such a plugin? Does it makes sense?
> >
> > Please test and report feedback to [EMAIL PROTECTED] I will happily
> > answer all mails there.
> >
> > salu2
> >
>
>
> --
> Renaud Richardet                                      +1 617 230 9112
> my email is my first name at apache.org      http://www.oslutions.com
>
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Apache Droids - standalone crawl framework

Reply via email to