hi folks,

at the moment our architecture looks like this:

the cluster has 4 servers:
 - server1: namenode, secondary namenode, jobtracker, hbase master
 - server2: datanode, tasktracker, hbase regionserver, zookeeper1
 - server3: datanode, tasktracker, hbase regionserver, zookeeper2
 - server4: datanode, tasktracker, hbase regionserver, zookeeper3

versions:
 - Linux version 2.6.26-2-amd64 (Debian 2.6.26-25lenny1)
 - hadoop-0.20.2-CDH3B4
 - hbase-0.90.1-CDH3B4
 - zookeeper-3.3.2-CDH3B4

hardware:
 - CPU: 2x AMD Opteron(tm) Processor 250 (2.4GHz)
 - disk: 500 GB, software raid raid1 (2x WDC WD5000AAKB-00H8A0, ATA DISK drive)
 - memory: 2 GB
 - network: 1 Gbps Ethernet

at the moment the servers are shared for hdfs / mapreduce / hbase / zookeeper  
as you can see
i can not imagine, that this is the best practice...

it is not a problem to add further servers, but what is an proper architecture, 
to achieve the best performance?
is that ok, that hdfs (data nodes) and tasktracker (mapreduce) are running on 
the same servers?

should zookeeper ensamble be running on extra dedicated servers, what is about 
region servers?

what is about memory?
how many memory should each server have (hdfs, mapred, hbase, zoo)?

the MR jobs on our Hbase table are running far too slow... RowCounter is 
running about 13 minutes for 3249727 rows, thats just inacceptable

regards
andre

Reply via email to