Gaurav Agarwal wrote: > Hi Andrzej, > > Thanks a lot for pointing out the features to me. I greatly appreciate the > help. Things look a lot better now :) > > Just one more thing: Can you point me to any document/email/discussion > (internal or published) which can give me some insights about the > architecture of Nutch 0.8.x and may be the information on the kind of data > that goes in every directory.
If Wiki doesn't already contain this info (I didn't check) then only the mailing lists may contain it ... though most of the stuff is the same, the basic work cycle is still the same. Data formats differ, e.g. webdb was split into two parts, outlinks are stored in crawl_parse (and in parse_data), and there are those funky part-xxxx subdirectories, which are a side-effect of using Hadoop. Other than that not much changed in the data layout. When it comes to the architecture, it was completely rewritten - I don't think there's any detailed documentation on this, though... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
