On 2/20/07, Renaud Richardet <[EMAIL PROTECTED]> wrote: > Hi Thorsten, > > I have quickly looked at the Droid code, and was wondering why you don't > want to completely reuse the Nutch plugin API in Droid. This way, you > could reuse the Nutch parse-* plugins without modifications. Just trying > to understand...
hmm.. interesting .. I am not fully on board with Nutch. But how would the end output of such crawl be.. as i.e .. HTML file 1 crawl --> parse-html --> parsing rule say.. "pick up <a href=> tags" --> dump it to a predefined text file (i.e. 1 "ahref tag" per line or whatever based on template or something) ... or?? cos the current Nutch saves it in bin format.. so I am trying to understand here as well... Regards > > I have finished a first version of an Apache Labs project called Apache > > Droids and checked it in. All Apache committer have write access there > > so fell all free to enhance the code. Like said all committer have write > > access on droids and everybody is welcome to join the effort. > > > > Please see http://svn.apache.org/repos/asf/labs/droids/README.TXT to get > > started. > > > > What is this? > > ------------- > > Droids aims to be an intelligent standalone crawl framework that > > automatically seeks out relevant online information based on the user's > > specifications. The core is a simple crawler which can be > > easily extended by plugins. So if a project/app needs special > > processing for a crawled url one can easily write a plugin to implement > > the functionality. > > > > Why was it created? > > ------------------- > > Mainly because of personal curiosity: > > The background of this work is that Cocoon trunk does not provide a > > crawler anymore and Forrest is based on it, meaning we cannot update > > anymore till we found a crawler replacement. Getting more involved in > > Solr and Nutch I see request for a generic standalone crawler. > > > > How does the first implementation crawler-x-m02y07 looks like? > > -------------------------------------------------------------- > > I took nutch, ripped out and modified the awesome plugin/extension > > framework to create the droid core. > > Now I could implement all funtionality in plugins. Droids should make > > it very easy to extend it. > > I wrote some proof of concept plugins that make up crawler-x-m02y07 to > > - crawl an url (CrawlerImpl) > > - extract links (only <a/> ATM) via a parse-html plugin > > - merge them with the queue > > - save or print out the crawled pages. > > > > Why crawler-x-m02y07? > > --------------------- > > Droids tries to be a framework for different droids. > > The first implementation is a "crawler" with the name "x" > > first archived in the second "m"onth of the "y"ear 20"07" > > > > Next steps > > ---------- > > I still need to write a droids factory, that one can write > > another implementation then Xm02y07 as crawler plugin and invoke it via > > the Cli. Another todo is to implement a dependency system like Apache > > Ivy instead to copycat the nutch approach. > > > > Open questions for nutch > > ------------------------ > > Exists interest for a nutch crawler plugin to utilize native nutch > > plugins and imitate the nutch crawler in Droids? Is nutch interested in > > such a plugin? Does it makes sense? > > > > Please test and report feedback to [EMAIL PROTECTED] I will happily > > answer all mails there. > > > > salu2 > > > > > -- > Renaud Richardet +1 617 230 9112 > my email is my first name at apache.org http://www.oslutions.com > > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
