Steve Severance wrote: > Let me actually refine that question we do some directories like the linkdb > have a current, and why do others like parse_data not? Is there a convention > on this?
First, to answer your original question: you should use MapFileOutputFormat class for reading such output. It handles these part-xxxx subdirectories automatically. Second, the "current" subdirectory is there in order to properly handle DB updates - or actually replacements - see e.g. CrawlDb.install() method for details. This is not needed in case of segments, which are created once and never updated. Thirdly, although you didn't ask about it ;) the latest version of Hadoop contains a handy facility called Counters - if you use the PR PowerMethod you need to collect PR from dangling nodes in order to redistribute it later. You can use Counters for this, and save on a separate aggregation step. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
