Lars Francke created HDFS-7575:
----------------------------------

             Summary: NameNode not handling heartbeats properly after HDFS-2832
                 Key: HDFS-7575
                 URL: https://issues.apache.org/jira/browse/HDFS-7575
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: Lars Francke


Before HDFS-2832 each DataNode would have a unique storageId which included its 
IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage 
directory which is just a random UUID.

They send reports per storage directory in their heartbeats. This heartbeat is 
processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} 
method. Pre HDFS-2832 this would just store the information per Datanode. After 
the patch though each DataNode can have multiple different storages so it's 
stored in a map keyed by the storage Id.

This works fine for all clusters that have been installed post HDFS-2832 as 
they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 
different keys. On each Heartbeat the Map is searched and updated 
({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}):

{code:title=DatanodeStorageInfo}
  void updateState(StorageReport r) {
    capacity = r.getCapacity();
    dfsUsed = r.getDfsUsed();
    remaining = r.getRemaining();
    blockPoolUsed = r.getBlockPoolUsed();
  }
{code}

On clusters that were upgraded from a pre HDFS-2832 version though the storage 
Id has not been rewritten (at least not on the four clusters I checked) so each 
directory will have the exact same storageId. That means there'll be only a 
single entry in the {{storageMap}} and it'll be overwritten by a random 
{{StorageReport}} from the DataNode. This can be seen in the {{updateState}} 
method above. This just assigns the capacity from the received report, instead 
it should probably sum it up per received heartbeat.

The Balancer seems to be one of the only things that actually uses this 
information so it now considers the utilization of a random drive per DataNode 
for balancing purposes.

Things get even worse when a drive has been added or replaced as this will now 
get a new storage Id so there'll be two entries in the storageMap. As new 
drives are usually empty it skewes the balancers decision in a way that this 
node will never be considered over-utilized.

Another problem is that old StorageReports are never removed from the 
storageMap. So if I replace a drive and it gets a new storage Id the old one 
will still be in place and used for all calculations by the Balancer until a 
restart of the NameNode.

I can try providing a patch that does the following:

* Instead of using a Map I could just store the array we receive or instead of 
storing an array sum up the values for reports with the same Id
* On each heartbeat clear the map (so we know we have up to date information)

Does that sound sensible?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to