ravi jagan wrote:
Cluster Summary

I am running a crawl on about 1 Million web domains. After 30% Map is done I
see the following usage
The Non DFS uses seems very high like 31G. This means nutch is creating too
many temporary files local
to that node. Is this correct ? Hoping someone will answer this post with at
least a Ok/not Ok.
First Crawl on the hadoop.  No other jobs running. DFS had 10G of data
before this job started

314 files and directories, 460 blocks = 774 total. Heap Size is 14.82 MB /
966.69 MB (1%) Configured Capacity : 377.91 GB
 DFS Used        :       60.31 GB
 Non DFS Used    :       31.58 GB
 DFS Remaining   :       286.02 GB
 DFS Used%       :       15.96 %
 DFS Remaining%  :       75.69 %
 Live Nodes      :       8
 Dead Nodes      :       0



It depends on how much content you are downloading. The 31GB that you see at 300,000 pages ... this gives ~100kB per page, which is a relatively large number. Is this with parsing turned on or off? If you are crawling with unlimited size then it's possible that large documents (such as PDFs and Office documents) inflate this number.

The issue of non-DFS vs. DFS usage - this is normal. Intermediate job output is stored locally, so only during reduce phase you will start seeing DFS usage climbing up, and non-DFS usage decreasing progressively as reduce tasks complete their job.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to