Hello Arv,
It is possible to run multiple datanodes on a single machine, and this can be 
useful for small-scale test scenarios.  Also you mentioned in your previous 
message that you have a Hadoop implementation with only one physical datanode 
server and want to replicate within it, between spindles.  This also makes 
sense, and will work.  Of course, if you have two datanodes running you will 
get only order-2 replication, not order-3, even if the replication has been set 
to 3.

I will describe the config in a moment, but I would first like to point out 
that in clusters with even a few datanode servers, one is better off with 
cross-server replication.  Without cross-server replication, losing the System 
disk will make ALL data volumes unavailable.  And of course, multiple datanodes 
running on one server will compete for cores, NICs, bus, and memory access, 
even if not for spindles.

A previous responder suggested running two namenodes also, but it wasn't clear 
whether he meant two primaries or one primary and one secondary/checkpoint 
nameserver.  The latter is fine, but running two primary namenodes is 
definitely not the thing to do!

Anyway, here's how you set it up.  I have done this recently with v0.21.0, with 
two datanode processes in a single box (along with namenode sharing the same 
box), and it did replicate correctly between the two.  I haven't tried it with 
> 2 datanodes, and I don't know what the impact on process efficiency would be, 
but that would probably work too.

1. In your HADOOP_HOME directory, copy the "conf" directory to, say, "conf2".

2. In the conf2 directory, edit as follows:

  a) In hadoop-env.sh, provide unique non-default HADOOP_IDENT_STRING, e.g. 
${USER}_02
  b) In hdfs-site.xml, change dfs.data.dir to show the desired targets/volumes 
for datanode#2, and of course make sure the corresponding target directories 
exist.  Also remove these targets from the dfs.data.dir target list for 
datanode#1 in conf/hdfs-site.xml.
  c) in hdfs-site.xml, set the four following "address:port" strings to 
something non-conflicting with the other datanode and other processes running 
on this box:
    - dfs.datanode.address  (default 0.0.0.0:50010)
    - dfs.datanode.ipc.address  (default 0.0.0.0:50020)
    - dfs.datanode.http.address  (default 0.0.0.0:50075)
    - dfs.datanode.https.address  (default 0.0.0.0:50475)
Note: the defaults above are what datanode#1 is probably running on.  I added 2 
to each port number for datanode#2 and it seemed to work okay.  You might also 
wish to note the default ports associated with the namenode and job/task 
tracker processes, in case they are running on the same box:
    - fs.default.name  0.0.0.0:9000
    - dfs.http.address  0.0.0.0:50070
    - dfs.https.address  0.0.0.0:50470
    - dfs.secondary.http.address  0.0.0.0:50090
    - mapred.job.tracker.http.address  0.0.0.0:50030
    - mapred.task.tracker.report.address  127.0.0.1:0
    - mapred.task.tracker.http.address  0.0.0.0:50060

3. At this point, launching with:
    bin/hdfs --config $HADOOP_HOME/conf2 datanode
will work.  To make it convenient to launch as a service, you can add a couple 
lines to the end of the bin/start-dfs.sh script like:
    HADOOP_CONF_DIR2=$HADOOP_HOME/conf2
    "$HADOOP_COMMON_HOME"/bin/hadoop-daemons.sh --config $HADOOP_CONF_DIR2 
--script "$bin"/hdfs start datanode $dataStartOpt

Hope this helps,
--Matt

On Sep 15, 2010, at 8:50 AM, Arv Mistry wrote:

Hi,

Is it possible to run multiple data nodes on a single machine? I
currently have a machine with multiple disks and enough disk capacity
for replication across them. I don't need redundancy at the machine
level but would like to be able to handle a single disk failure.

So I was thinking if I can run multiple DataNodes on a single machine
each assigned a separate disk that would give me the protection I need
against disk failure.

Can anyone give me any insights in to how I would setup multiple
DataNodes to run on a single machine? Thanks in advance,

Cheers Arv

Reply via email to