Hi Manish,

>- why 15+ GBs?  Do we allocate all memory to the NameNode? or  
>just allocate some number using -Xmx and leave the rest available so  
>the machine doesnt start swapping?


We allocated memory using -Xmx. NameNode stores the HDFS namespace in memory, 
so, the bigger your namespace, the bigger would be your heap. My guess is that 
if you have more than 15 million files  with 20 million blocks you might need 
such a big system. But again, its best to see how your namenode is performing 
and how much memory it is consuming. 

>  - why RAID5?
> - If running RAID 5, why is this necessary?
Not absolute necessary. So, the namenode index or metadata is critical piece of 
data. You cannot afford to lose or corrupt it. That is the reason, we have an 
option of specifying multiple directories to have different copies in parallel. 
You could configure the directories to whatever you would like it to be. 
Multiple drives, NFS....

>- Configure the name node to store one set of transaction logs on a  
>separate disk from the index.
> why?
This feature is not yet supported, but a good one to have. Right now both 
transaction logs and index (I am assuming this means image) are in same 
directory and cannot to be configured to be placed in separate directories. We 
should correct the wiki.

> - Configure the name node to store another set of transaction logs to  
> a network mounted disk.
>      - why?
As explained above, this is to have multiple copies of your metadata 
(dfs.name.dir in particular)

>- Do not host DataNode, JobTracker or TaskTracker services on the  
>same system.
typically Datanode and TaskTracker are run on all nodes while JobTracker is run 
on dedicated node like NameNode (SecondaryNameNode).
Sometimes, TaskTracker might crash and bring down a node and you do not want 
your JobTracker or NameNode to be on that system.

PS: Could you point to the wiki you are referring to? We might need to make 
some corrections.

Thanks,
Lohit

----- Original Message ----
From: Manish Shah <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Tuesday, August 12, 2008 11:24:45 AM
Subject: NameNode hardware specs

Can someone help explain in a little more detail some of the reasons  
for the hardware specs that were recently added to the wiki for the  
NameNode.  I guess i'm interested in learning how others have settled  
on these specs?  Is it by observed behavior, or just recommended by  
other hadoop users?

- Use a good server with lots (15GB+) of RAM.
      - why 15+ GBs?  Do we allocate all memory to the NameNode? or  
just allocate some number using -Xmx and leave the rest available so  
the machine doesnt start swapping?

- Consider using fast RAID5 storage for keeping the index.
      - why RAID5?

- List more than one name node directory in the configuration, so  
that multiple copies of the indices will be stored. As long as the  
directories are on separate disks, a single full disk will not  
corrupt the index.
      - If running RAID 5, why is this necessary?

- Configure the name node to store one set of transaction logs on a  
separate disk from the index.
      - why?

- Configure the name node to store another set of transaction logs to  
a network mounted disk.
      - why?

- Do not host DataNode, JobTracker or TaskTracker services on the  
same system.
      - how much memory would the job tracker need?  Does it use a  
lot of CPU? In general, what are good specs for a job tracker machine  
and can the machine be shared with other services?

Thanks so much for the help.  I think it would be hugely helpful for  
the community to start describing their respective setups for hadoop  
clusters in more detail than just the config for datanodes and  
cluster size.  I think we all want to be confident that we are  
spending money on the right machines to grow our cluster the right way.


Most appreciated,

- Manish
Co-Founder Rapleaf.com

We're looking for a product manager, sys admin, and software  
engineers...$10K referral award

Reply via email to