to me, the notion of replication is to provide a failsafe mechanism in case some nodes go down, hence running two datanode on a single host does not serve this basic purpose. As already mentioned you can run two data nodes using two instanced of hadoop with different ports. In case you do not want to replicate , you can simply set the replication factor to 1.
On Thu, Sep 16, 2010 at 6:15 AM, Matthew Foley <ma...@yahoo-inc.com> wrote: > Hello Arv, > It is possible to run multiple datanodes on a single machine, and this can > be useful for small-scale test scenarios. Also you mentioned in your > previous message that you have a Hadoop implementation with only one > physical datanode server and want to replicate within it, between spindles. > This also makes sense, and will work. Of course, if you have two datanodes > running you will get only order-2 replication, not order-3, even if the > replication has been set to 3. > > I will describe the config in a moment, but I would first like to point out > that in clusters with even a few datanode servers, one is better off with > cross-server replication. Without cross-server replication, losing the > System disk will make ALL data volumes unavailable. And of course, multiple > datanodes running on one server will compete for cores, NICs, bus, and > memory access, even if not for spindles. > > A previous responder suggested running two namenodes also, but it wasn't > clear whether he meant two primaries or one primary and one > secondary/checkpoint nameserver. The latter is fine, but running two > primary namenodes is definitely not the thing to do! > > Anyway, here's how you set it up. I have done this recently with v0.21.0, > with two datanode processes in a single box (along with namenode sharing the > same box), and it did replicate correctly between the two. I haven't tried > it with > 2 datanodes, and I don't know what the impact on process > efficiency would be, but that would probably work too. > > 1. In your HADOOP_HOME directory, copy the "conf" directory to, say, > "conf2". > > 2. In the conf2 directory, edit as follows: > > a) In hadoop-env.sh, provide unique non-default HADOOP_IDENT_STRING, e.g. > ${USER}_02 > b) In hdfs-site.xml, change dfs.data.dir to show the desired > targets/volumes for datanode#2, and of course make sure the corresponding > target directories exist. Also remove these targets from the dfs.data.dir > target list for datanode#1 in conf/hdfs-site.xml. > c) in hdfs-site.xml, set the four following "address:port" strings to > something non-conflicting with the other datanode and other processes > running on this box: > - dfs.datanode.address (default 0.0.0.0:50010) > - dfs.datanode.ipc.address (default 0.0.0.0:50020) > - dfs.datanode.http.address (default 0.0.0.0:50075) > - dfs.datanode.https.address (default 0.0.0.0:50475) > Note: the defaults above are what datanode#1 is probably running on. I > added 2 to each port number for datanode#2 and it seemed to work okay. You > might also wish to note the default ports associated with the namenode and > job/task tracker processes, in case they are running on the same box: > - fs.default.name 0.0.0.0:9000 > - dfs.http.address 0.0.0.0:50070 > - dfs.https.address 0.0.0.0:50470 > - dfs.secondary.http.address 0.0.0.0:50090 > - mapred.job.tracker.http.address 0.0.0.0:50030 > - mapred.task.tracker.report.address 127.0.0.1:0 > - mapred.task.tracker.http.address 0.0.0.0:50060 > > 3. At this point, launching with: > bin/hdfs --config $HADOOP_HOME/conf2 datanode > will work. To make it convenient to launch as a service, you can add a > couple lines to the end of the bin/start-dfs.sh script like: > HADOOP_CONF_DIR2=$HADOOP_HOME/conf2 > "$HADOOP_COMMON_HOME"/bin/hadoop-daemons.sh --config $HADOOP_CONF_DIR2 > --script "$bin"/hdfs start datanode $dataStartOpt > > Hope this helps, > --Matt > > On Sep 15, 2010, at 8:50 AM, Arv Mistry wrote: > > Hi, > > Is it possible to run multiple data nodes on a single machine? I > currently have a machine with multiple disks and enough disk capacity > for replication across them. I don't need redundancy at the machine > level but would like to be able to handle a single disk failure. > > So I was thinking if I can run multiple DataNodes on a single machine > each assigned a separate disk that would give me the protection I need > against disk failure. > > Can anyone give me any insights in to how I would setup multiple > DataNodes to run on a single machine? Thanks in advance, > > Cheers Arv > >