1.Yes, those will ensure that file will be written to available nodes .

2.

BlockManager: defaultReplication         = 2

This is the Default block replication which you configured in server 
(Namenode). The actual number of replications can be specified when the file is 
created. The default is used if replication is not specified in create time.



3. "dfs.replication" is client(in your case confluent kafka) side property.May 
be,you can cross check this configuration in kafka.



-Brahma Reddy Battula
________________________________
From: Nishant Verma <nishant.verma0...@gmail.com>
Sent: Friday, June 30, 2017 7:50 PM
To: common-u...@hadoop.apache.org
Subject: Ensure High Availability of Datanodes in a HDFS cluster


Hi

I have a two master and three datanode HDFS cluster setup. They are AWS EC2 
instances.

I have to test High Availability of Datanodes i.e., if during load run where 
data is written on HDFS, a datanode dies then there is no data loss. The two 
remaning datanodes which are alive should take care of the data writes.

I have set below properties in hdfs-site.xml. dfs.replication = 2 (because if 
any one datanode dies, then there is no issue of not able to meet replication 
factor)

dfs.client.block.write.replace-datanode-on-failure.policy = ALWAYS
dfs.client.block.write.replace-datanode-on-failure.enable = true
dfs.client.block.write.replace-datanode-on-failure.best-effort = true


My questions are:

1 - Does setting up above properties suffice my Datanode High Availability? Or 
something else is needed? 2 - On dfs service startup, I do see below INFO on 
namenode logs:

2017-06-27 10:51:52,546 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: defaultReplication  
       = 2
2017-06-27 10:51:52,546 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: maxReplication      
       = 512
2017-06-27 10:51:52,546 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: minReplication      
       = 1
2017-06-27 10:51:52,546 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
maxReplicationStreams      = 2


But I still see that the files being created on HDFS are with replication 
factor 3. Why is that so? This would hurt my High Availability of Datanodes.

-rw-r--r--   3 hadoopuser supergroup     247373 2017-06-29 09:36 
/topics/testTopic/year=2017/month=06/day=29/hour=14/testTopic+210+0001557358+0001557452
-rw-r--r--   3 hadoopuser supergroup       1344 2017-06-29 08:33 
/topics/testTopic/year=2017/month=06/day=29/hour=14/testTopic+228+0001432839+0001432850
-rw-r--r--   3 hadoopuser supergroup       3472 2017-06-29 09:03 
/topics/testTopic/year=2017/month=06/day=29/hour=14/testTopic+228+0001432851+0001432881
-rw-r--r--   3 hadoopuser supergroup       2576 2017-06-29 08:33 
/topics/testTopic/year=2017/month=06/day=29/hour=14/testTopic+23+0001236477+0001236499


P.S. - My records are written on HDFS by Confluent Kafka Connect HDFS Sink 
Connector.


Thanks

Nishant

Reply via email to