Re: Nutch1.0 hadoop dfs usage doesnt seem right . experience users please comment

Andrzej Bialecki Sun, 10 May 2009 23:12:57 -0700

ravi jagan wrote:

Cluster Summary


I am running a crawl on about 1 Million web domains. After 30% Map is done I
see the following usage
The Non DFS uses seems very high like 31G. This means nutch is creating too
many temporary files local
to that node. Is this correct ? Hoping someone will answer this post with at
least a Ok/not Ok.
First Crawl on the hadoop.  No other jobs running. DFS had 10G of data
before this job started

314 files and directories, 460 blocks = 774 total. Heap Size is 14.82 MB /

966.69 MB (1%)Configured Capacity : 377.91 GB

 DFS Used        :       60.31 GB
 Non DFS Used    :       31.58 GB
 DFS Remaining   :       286.02 GB
 DFS Used%       :       15.96 %
 DFS Remaining%  :       75.69 %
 Live Nodes      :       8
 Dead Nodes      :       0

It depends on how much content you are downloading. The 31GB that yousee at 300,000 pages ... this gives ~100kB per page, which is arelatively large number. Is this with parsing turned on or off? If youare crawling with unlimited size then it's possible that large documents(such as PDFs and Office documents) inflate this number.

The issue of non-DFS vs. DFS usage - this is normal. Intermediate joboutput is stored locally, so only during reduce phase you will startseeing DFS usage climbing up, and non-DFS usage decreasing progressivelyas reduce tasks complete their job.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Nutch1.0 hadoop dfs usage doesnt seem right . experience users please comment

Reply via email to