[ 
https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14286527#comment-14286527
 ] 

Colin Patrick McCabe commented on HDFS-7575:
--------------------------------------------

bq. So, you claim that the current format is a layout, where some storage IDs 
could be the same?... It is clearly specified in the LV -49 that the IDs must 
be distinct.

What's important is what was implemented, not what was written in the comment 
about the layout version.  And what was implemented does allow duplicate 
storage IDs.

bq. Is \[two volumes that map to the same directory\] a practical error? Have 
you seen it in practice?

Yes.  Recently we had a cluster with two datanodes connected to the same shared 
storage accidentally.  I guess you could argue that lock files should prevent 
problems here.  However, I do not like the idea of datanodes modifying VERSION 
on startup at all.  If one of the DNs had terminated before the other one tried 
to lock the directory, it would have succeeded.  And with the "retry failed 
volume" stuff, we probably have a wide window for this to happen.

bq. We only change a storage ID when it is invalid but not changing the storage 
ID arbitrarily. Valid storage IDs are permanent.

Again, if I accidentally duplicate a directory on a datanode, then the storage 
ID morph for one of the directories.  That doesn't sound permanent to me.

bq. The code is for validating storage IDs (but not for de-duplication) and is 
very simple. It is good to keep.

I agree that it is good to validate the storage IDs are unique.  But this is 
the same as when we validate that the cluster ID is correct, or the layout 
version is correct.  We don't change incorrect values to "fix" them.  If 
they're wrong then we need to find out why, not sweep the problem under the rug.

Are there any practical arguments in favor of not doing a layout version 
change?  The main argument in favor of not changing the layout here I see is 
basically that this isn't a "big enough" change to merit a new LV.  But that 
seems irrelevant to me-- the question is which approach is better for error 
handling and more maintainable.

> NameNode not handling heartbeats properly after HDFS-2832
> ---------------------------------------------------------
>
>                 Key: HDFS-7575
>                 URL: https://issues.apache.org/jira/browse/HDFS-7575
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.4.0, 2.5.0, 2.6.0
>            Reporter: Lars Francke
>            Assignee: Arpit Agarwal
>            Priority: Critical
>         Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, 
> HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, 
> HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, 
> testUpgrade22via24GeneratesStorageIDs.tgz, 
> testUpgradeFrom22GeneratesStorageIDs.tgz, 
> testUpgradeFrom24PreservesStorageId.tgz
>
>
> Before HDFS-2832 each DataNode would have a unique storageId which included 
> its IP address. Since HDFS-2832 the DataNodes have a unique storageId per 
> storage directory which is just a random UUID.
> They send reports per storage directory in their heartbeats. This heartbeat 
> is processed on the NameNode in the 
> {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would 
> just store the information per Datanode. After the patch though each DataNode 
> can have multiple different storages so it's stored in a map keyed by the 
> storage Id.
> This works fine for all clusters that have been installed post HDFS-2832 as 
> they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 
> different keys. On each Heartbeat the Map is searched and updated 
> ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}):
> {code:title=DatanodeStorageInfo}
>   void updateState(StorageReport r) {
>     capacity = r.getCapacity();
>     dfsUsed = r.getDfsUsed();
>     remaining = r.getRemaining();
>     blockPoolUsed = r.getBlockPoolUsed();
>   }
> {code}
> On clusters that were upgraded from a pre HDFS-2832 version though the 
> storage Id has not been rewritten (at least not on the four clusters I 
> checked) so each directory will have the exact same storageId. That means 
> there'll be only a single entry in the {{storageMap}} and it'll be 
> overwritten by a random {{StorageReport}} from the DataNode. This can be seen 
> in the {{updateState}} method above. This just assigns the capacity from the 
> received report, instead it should probably sum it up per received heartbeat.
> The Balancer seems to be one of the only things that actually uses this 
> information so it now considers the utilization of a random drive per 
> DataNode for balancing purposes.
> Things get even worse when a drive has been added or replaced as this will 
> now get a new storage Id so there'll be two entries in the storageMap. As new 
> drives are usually empty it skewes the balancers decision in a way that this 
> node will never be considered over-utilized.
> Another problem is that old StorageReports are never removed from the 
> storageMap. So if I replace a drive and it gets a new storage Id the old one 
> will still be in place and used for all calculations by the Balancer until a 
> restart of the NameNode.
> I can try providing a patch that does the following:
> * Instead of using a Map I could just store the array we receive or instead 
> of storing an array sum up the values for reports with the same Id
> * On each heartbeat clear the map (so we know we have up to date information)
> Does that sound sensible?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to