Re: [Nutch-general] 0.8.x Crawler compared to 0.7.2 Crawler

Andrzej Bialecki Wed, 28 Mar 2007 12:43:32 -0800

Gaurav Agarwal wrote:
> Hi Andrzej,
> 
> Thanks a lot for pointing out the features to me. I greatly appreciate the
> help. Things look a lot better now :)
> 
> Just one more thing: Can you point me to any document/email/discussion
> (internal or published) which can give me some insights about the
> architecture of Nutch 0.8.x and may be the information on the kind of data
> that goes in every directory.


If Wiki doesn't already contain this info (I didn't check) then only the 
mailing lists may contain it ... though most of the stuff is the same, 
the basic work cycle is still the same. Data formats differ, e.g. webdb 
was split into two parts, outlinks are stored in crawl_parse (and in 
parse_data), and there are those funky part-xxxx subdirectories, which 
are a side-effect of using Hadoop. Other than that not much changed in 
the data layout.

When it comes to the architecture, it was completely rewritten - I don't 
think there's any detailed documentation on this, though...


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] 0.8.x Crawler compared to 0.7.2 Crawler

Reply via email to