hi folks, at the moment our architecture looks like this:
the cluster has 4 servers: - server1: namenode, secondary namenode, jobtracker, hbase master - server2: datanode, tasktracker, hbase regionserver, zookeeper1 - server3: datanode, tasktracker, hbase regionserver, zookeeper2 - server4: datanode, tasktracker, hbase regionserver, zookeeper3 versions: - Linux version 2.6.26-2-amd64 (Debian 2.6.26-25lenny1) - hadoop-0.20.2-CDH3B4 - hbase-0.90.1-CDH3B4 - zookeeper-3.3.2-CDH3B4 hardware: - CPU: 2x AMD Opteron(tm) Processor 250 (2.4GHz) - disk: 500 GB, software raid raid1 (2x WDC WD5000AAKB-00H8A0, ATA DISK drive) - memory: 2 GB - network: 1 Gbps Ethernet at the moment the servers are shared for hdfs / mapreduce / hbase / zookeeper as you can see i can not imagine, that this is the best practice... it is not a problem to add further servers, but what is an proper architecture, to achieve the best performance? is that ok, that hdfs (data nodes) and tasktracker (mapreduce) are running on the same servers? should zookeeper ensamble be running on extra dedicated servers, what is about region servers? what is about memory? how many memory should each server have (hdfs, mapred, hbase, zoo)? the MR jobs on our Hbase table are running far too slow... RowCounter is running about 13 minutes for 3249727 rows, thats just inacceptable regards andre
