Re: [Nutch-dev] Sequence File Question

Andrzej Bialecki Wed, 28 Mar 2007 12:35:23 -0800

Steve Severance wrote:
> Let me actually refine that question we do some directories like the linkdb
> have a current, and why do others like parse_data not? Is there a convention
> on this?


First, to answer your original question: you should use 
MapFileOutputFormat class for reading such output. It handles these 
part-xxxx subdirectories automatically.

Second, the "current" subdirectory is there in order to properly handle 
DB updates - or actually replacements - see e.g. CrawlDb.install() 
method for details. This is not needed in case of segments, which are 
created once and never updated.

Thirdly, although you didn't ask about it ;) the latest version of 
Hadoop contains a handy facility called Counters - if you use the PR 
PowerMethod you need to collect PR from dangling nodes in order to 
redistribute it later. You can use Counters for this, and save on a 
separate aggregation step.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Sequence File Question

Reply via email to