Hi Manish, >- why 15+ GBs? Do we allocate all memory to the NameNode? or >just allocate some number using -Xmx and leave the rest available so >the machine doesnt start swapping?
We allocated memory using -Xmx. NameNode stores the HDFS namespace in memory, so, the bigger your namespace, the bigger would be your heap. My guess is that if you have more than 15 million files with 20 million blocks you might need such a big system. But again, its best to see how your namenode is performing and how much memory it is consuming. > - why RAID5? > - If running RAID 5, why is this necessary? Not absolute necessary. So, the namenode index or metadata is critical piece of data. You cannot afford to lose or corrupt it. That is the reason, we have an option of specifying multiple directories to have different copies in parallel. You could configure the directories to whatever you would like it to be. Multiple drives, NFS.... >- Configure the name node to store one set of transaction logs on a >separate disk from the index. > why? This feature is not yet supported, but a good one to have. Right now both transaction logs and index (I am assuming this means image) are in same directory and cannot to be configured to be placed in separate directories. We should correct the wiki. > - Configure the name node to store another set of transaction logs to > a network mounted disk. > - why? As explained above, this is to have multiple copies of your metadata (dfs.name.dir in particular) >- Do not host DataNode, JobTracker or TaskTracker services on the >same system. typically Datanode and TaskTracker are run on all nodes while JobTracker is run on dedicated node like NameNode (SecondaryNameNode). Sometimes, TaskTracker might crash and bring down a node and you do not want your JobTracker or NameNode to be on that system. PS: Could you point to the wiki you are referring to? We might need to make some corrections. Thanks, Lohit ----- Original Message ---- From: Manish Shah <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Tuesday, August 12, 2008 11:24:45 AM Subject: NameNode hardware specs Can someone help explain in a little more detail some of the reasons for the hardware specs that were recently added to the wiki for the NameNode. I guess i'm interested in learning how others have settled on these specs? Is it by observed behavior, or just recommended by other hadoop users? - Use a good server with lots (15GB+) of RAM. - why 15+ GBs? Do we allocate all memory to the NameNode? or just allocate some number using -Xmx and leave the rest available so the machine doesnt start swapping? - Consider using fast RAID5 storage for keeping the index. - why RAID5? - List more than one name node directory in the configuration, so that multiple copies of the indices will be stored. As long as the directories are on separate disks, a single full disk will not corrupt the index. - If running RAID 5, why is this necessary? - Configure the name node to store one set of transaction logs on a separate disk from the index. - why? - Configure the name node to store another set of transaction logs to a network mounted disk. - why? - Do not host DataNode, JobTracker or TaskTracker services on the same system. - how much memory would the job tracker need? Does it use a lot of CPU? In general, what are good specs for a job tracker machine and can the machine be shared with other services? Thanks so much for the help. I think it would be hugely helpful for the community to start describing their respective setups for hadoop clusters in more detail than just the config for datanodes and cluster size. I think we all want to be confident that we are spending money on the right machines to grow our cluster the right way. Most appreciated, - Manish Co-Founder Rapleaf.com We're looking for a product manager, sys admin, and software engineers...$10K referral award