[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14288285#comment-14288285 ] Tsz Wo Nicholas Sze commented on HDFS-7575: --- Yes. It is great that we already have the log. +1 on HDFS-7575.05.patch NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14288260#comment-14288260 ] Arpit Agarwal commented on HDFS-7575: - Thanks Colin. bq. We should log the old (invalid) storage id. Hi Nicholas, we are already doing so in the v05 patch. In {{createStorageID}}: {code} LOG.info(Generated new storageID + sd.getStorageUuid() + for directory + sd.getRoot() + (oldStorageID == null ? : ( to replace + oldStorageID))); {code} Is this what you were looking for? NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14287999#comment-14287999 ] Tsz Wo Nicholas Sze commented on HDFS-7575: --- I don't think it's productive to argue about whether this represents a true layout version change whether it is layout version changey enough. Clearly we both agree that doing an LV change here would work and solve the problem. At the end of the day, we have to make the decision based on which way is more maintainable. You seem suggesting that even there is no layout format change, it is good to update the layout version because of the bug. Is it correct? ... It helps by not harming ... I guess you mean it actually does not help at all. It only shows that the cause of duplication problem is not from this bug. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14288071#comment-14288071 ] Colin Patrick McCabe commented on HDFS-7575: I looked at patch 005 more carefully, and now I can see that it only ever modifies storage IDs when the ID can't be parsed as a UUID. So this should really only have an effect with storageIDs generated by pre-upgraded clusters. The other nice thing about patch 005 is that it can easily be backported to 2.6.1, and it will be quicker to upgrade because it doesn't involve a LV change. So on reconsideration, I am +1 for patch 005 (the latest version). NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286169#comment-14286169 ] Colin Patrick McCabe commented on HDFS-7575: This patch does change the layout format. It changes it from one where storage ID may or may not be unique to one where it definitely is. Can you response to the practical points I made above? I made a few points that nobody has responded to yet. * Changing the storage ID during startup basically changes storage ID from being a permanent identifier to a temporary one... makes persisting this later impossible. It commits us to an architecture where block locations can't be persisted. * With approach #1, we have to carry the burden of the dedupe code forever. * Approach #1 degrades error handling. If you somehow end up with two volumes that map to the same directory, the code silently does the wrong thing. I would appreciate a response to these. thanks NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286051#comment-14286051 ] Tsz Wo Nicholas Sze commented on HDFS-7575: --- ... we have bumped the layout version in the past even when the old software could handle the new layout. ... For HDFS-6482, it does change layout format. So bumping layout version makes sense. However, the patch here does not change layout format. Disagree? NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286230#comment-14286230 ] Tsz Wo Nicholas Sze commented on HDFS-7575: --- Can you response to the practical points I made above? If there is not layout format change, the practical points seem irrelevant. Anyway, let me comment on them. Changing the storage ID during startup basically changes storage ID from being a permanent identifier to a temporary one... We only change a storage ID when it is invalid but not changing the storage ID arbitrarily. Valid storage IDs are permanent. With approach #1, we have to carry the burden of the dedupe code forever. The code is for validating storage IDs (but not for de-duplication) and is very simple. It is good to keep. ... If you somehow end up with two volumes that map to the same directory, the code silently does the wrong thing. Is this a practical error? Have you seen it in practice? NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286217#comment-14286217 ] Tsz Wo Nicholas Sze commented on HDFS-7575: --- This patch does change the layout format. It changes it from one where storage ID may or may not be unique to one where it definitely is. So, you claim that the current format is a layout, where some storage IDs could be the same? {code} ADD_DATANODE_AND_STORAGE_UUIDS(-49, Replace StorageID with DatanodeUuid. + Use distinct StorageUuid per storage directory.), {code} It is clearly specified in the LV -49 that the IDs must be distinct. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286583#comment-14286583 ] Tsz Wo Nicholas Sze commented on HDFS-7575: --- For the so called practical points you made (say, Again, if I accidentally duplicate a directory on a datanode, ...) , how could updating layout version help? NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286527#comment-14286527 ] Colin Patrick McCabe commented on HDFS-7575: bq. So, you claim that the current format is a layout, where some storage IDs could be the same?... It is clearly specified in the LV -49 that the IDs must be distinct. What's important is what was implemented, not what was written in the comment about the layout version. And what was implemented does allow duplicate storage IDs. bq. Is \[two volumes that map to the same directory\] a practical error? Have you seen it in practice? Yes. Recently we had a cluster with two datanodes connected to the same shared storage accidentally. I guess you could argue that lock files should prevent problems here. However, I do not like the idea of datanodes modifying VERSION on startup at all. If one of the DNs had terminated before the other one tried to lock the directory, it would have succeeded. And with the retry failed volume stuff, we probably have a wide window for this to happen. bq. We only change a storage ID when it is invalid but not changing the storage ID arbitrarily. Valid storage IDs are permanent. Again, if I accidentally duplicate a directory on a datanode, then the storage ID morph for one of the directories. That doesn't sound permanent to me. bq. The code is for validating storage IDs (but not for de-duplication) and is very simple. It is good to keep. I agree that it is good to validate the storage IDs are unique. But this is the same as when we validate that the cluster ID is correct, or the layout version is correct. We don't change incorrect values to fix them. If they're wrong then we need to find out why, not sweep the problem under the rug. Are there any practical arguments in favor of not doing a layout version change? The main argument in favor of not changing the layout here I see is basically that this isn't a big enough change to merit a new LV. But that seems irrelevant to me-- the question is which approach is better for error handling and more maintainable. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286574#comment-14286574 ] Tsz Wo Nicholas Sze commented on HDFS-7575: --- What's important is what was implemented, not what was written in the comment about the layout version. And what was implemented does allow duplicate storage IDs. I disagree. The implementation is a bug -- it supposes to change the old ids (in old id format) to use the new uuid format. The entire heterogeneous storage design requires storage ID to be unique. Which implementation works correctly with the duplicate storage IDs? NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286637#comment-14286637 ] Colin Patrick McCabe commented on HDFS-7575: bq. I disagree. The implementation is a bug – it supposes to change the old ids (in old id format) to use the new uuid format. The entire heterogeneous storage design requires storage ID to be unique. Which implementation works correctly with the duplicate storage IDs? I don't think it's productive to argue about whether this represents a true layout version change whether it is layout version changey enough. Clearly we both agree that doing an LV change here would work and solve the problem. At the end of the day, we have to make the decision based on which way is more maintainable. This does bring up a practical point, though. It will be easier to backport the silently modify the VERSION file patch to 2.6.1 than the LV change. In view of this, I think it's fine to backport the silently change VERSION fix to 2.6.1. I just don't want to have to support it forever in 3.0 and onward. bq. For the so called practical points you made (say, Again, if I accidentally duplicate a directory on a datanode, ...) , how could updating layout version help? If we check for directories with duplicate storage IDs and exclude them, then the system administrator becomes aware that there is a problem. It helps by not harming-- by not changing the VERSION file when we don't know for sure the reasons why the VERSION file is wrong. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284681#comment-14284681 ] Colin Patrick McCabe commented on HDFS-7575: So there are two approaches here: 1. silently (i.e., without user intervention), dedupe duplicate storage IDs when starting up the DataNode 2. create a new DataNode layout version and dedupe duplicate storage IDs during the upgrade. Arguments in favor of approach #1: * Collisions might happen that we need to dedupe repeatly. This argument seems specious since the probability is effectively less than the change of cosmic rays causing errors (as Nicholas pointed out). I think the probabilities outlined here make this argument a non-starter: https://en.wikipedia.org/wiki/Universally_unique_identifier#Random_UUID_probability_of_duplicates. Also, approach #1 only dedupes on a single datanode, but there can be many datanodes in the cluster. * As Suresh pointed out, the old software can easily handle cases where the Storage IDs are unique. So using a new layout version is not required to flip back and forth between old and new software. While this is true, we have bumped the layout version in the past even when the old software could handle the new layout. For example, HDFS-6482 added a new DN layout version even though the old software could use the new blockid-based layout. So this argument is basically just saying approach #1 is viable. But it doesn't tell us whether approach #1 is a good idea. * Nobody has made this argument yet, but you could argue that the upgrade process will be faster with approach #1 than approach #2. However, we've done datanode layout version upgrades on production clusters in the past and time hasn't been an issue. The JNI hardlink code (and soon, the Java7 hardlink code) eliminated the long delays that resulted from spawning shell commands. So I don't think this argument is persuasive. Arguments in favor of approach #2: * Changing the storage ID during startup basically changes storage ID from being a permanent identifier to a temporary one. This seems like a small change, but I would argue that it's really a big one, architecturally. For example, suppose we wanted to persist this information at some point. We couldn't really do that if it's changing all the time. * With approach #1, we have to carry the burden of the dedupe code forever. We can't ever stop deduping, even in Hadoop 3.0, because for all we know, the user has just upgraded, and was previously running 2.6 (a version with the bug) that we will have to correct. The extra run time isn't an issue, but the complexity is. What if our write to VERSION fails on one of the volume directories? What do we do then? And then if volume failures are tolerated, this directory could later come back and be an issue. The purpose of layout versions is so that we don't have to think about these kind of mix and match issues. * Approach #1 leaves us open to some weird scenarios. For example, what if I have /storage1 - /foo and /storage2 - /foo. In other words, you have what appears to be two volume root directories, but it's really the same directory. Approach #2 will complain, but approach #1 will happily rename the storageID of the /foo directory and continue with the corrupt configuration. This is what happens when you fudge error checking. So in conclusion I would argue for approach #2. Thoughts? NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285048#comment-14285048 ] Colin Patrick McCabe commented on HDFS-7575: bq. Layout version defines layout format but not the software (don't confuse it with the software version). The question here is whether there is a layout format change here. Are we changing from a layout, where some storage IDs could be the same, to a new layout, where all storage IDs have to be distinct? I think the answer is no since the same storage ID does not work even using the old software. Nicholas, I already addressed that in my comment. I wrote using a new layout version is not required to flip back and forth between old and new software While this is true, we have bumped the layout version in the past even when the old software could handle the new layout. Do you have any thoughts about the other points I mentioned? NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284937#comment-14284937 ] Tsz Wo Nicholas Sze commented on HDFS-7575: --- Layout version defines layout format but not the software (don't confuse it with the software version). The question here is whether there is a layout format change here. Are we changing from a layout, where some storage IDs could be the same, to a new layout, where all storage IDs have to be distinct? I think the answer is no since the same storage ID does not work even using the old software. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285135#comment-14285135 ] Colin Patrick McCabe commented on HDFS-7575: Just to be clear, I'd like to see some discussion of the points above before we commit this. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284973#comment-14284973 ] Tsz Wo Nicholas Sze commented on HDFS-7575: --- We should log the old (invalid) storage id. +1 on HDFS-7575.05.patch other than that. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14280522#comment-14280522 ] Arpit Agarwal commented on HDFS-7575: - It would be good to get this bug fixed instead of letting it fall off the radar on the layout change technicality. IMO either approach is better than leaving the bug unfixed. Please vote either +1 or -1 on either approach so we have more clarity. Thanks, Arpit. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279309#comment-14279309 ] Hadoop QA commented on HDFS-7575: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12692337/HDFS-7575.05.patch against trunk revision ce29074. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.balancer.TestBalancerWithMultipleNameNodes org.apache.hadoop.hdfs.TestDatanodeStartupFixesLegacyStorageIDs The following test timeouts occurred in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.server.namenode.ha.TestRetryCacheWithHA Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9223//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9223//console This message is automatically generated. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277895#comment-14277895 ] Daryn Sharp commented on HDFS-7575: --- bq. I think it's frustrating for storage IDs to change without warning just because HDFS was restarted. It will make diagnosing problems by reading log files harder because storageIDs might morph at any time. It also sets a bad precedent of not allowing downgrade and modifying VERSION files on the fly during startup. I'm confused. StorageIDs aren't going to repeatedly morph - unless there's a UUID collision that you argue can't happen. The important part is you always want unique storage ids. It's an internal default of hdfs that is not up to the user to assign. Succinctly stated, what I'd like is for storage ids to be generated if missing, re-generated if incorrectly formatted, or if there are dups. I think the latest patch actually does the first two, just not the dup check. bq. I'm surprised to hear you say that rollback should not be an option. It seems like the conservative thing to do here is to allow the user to restore to the VERSION file. Obviously we believe there will be no problems. But we always believe that, or else we wouldn't have made the change. Sometimes there are problems. I didn't say that. Rollback is for reverting an incompatible change. Changing the storage id is not incompatible. Unique ids are the default for newly formatted nodes. If you think unique storage ids may have subtle bugs (different than shared storage ids), then new clusters or newly formatted nodes are buggy. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277794#comment-14277794 ] Arpit Agarwal commented on HDFS-7575: - I prefer a layout version bump per my original patch, if for no other reason than the fact that the DataNode upgrade path is complicated enough without having to think about OOB metadata changes. In this case the metadata change is limited so I'd be okay with making the exception. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277795#comment-14277795 ] Tsz Wo Nicholas Sze commented on HDFS-7575: --- {quote} BTW, UUID.randomUUID isn't guaranteed to return a unique id. It's highly improbable, but possible, although more likely due to older storages, user copying a storage, etc. Although the storage ids are unique after the upgrade, if a disk is moved from one node to another, then a collision is possible. Hence another reason why I feel explicitly checking for collisions at startup should always be done. UUIDs are designed to be globally unique with a high probability when generated by trusted processes. Even when the volume of generated UUIDs is very high, which is certainly not the case for storage IDs. The probability of a storageID collision in normal operation is vanishingly small. https://en.wikipedia.org/wiki/Universally_unique_identifier#Random_UUID_probability_of_duplicates {quote} We usually compare the probability of collision with hardware failure probability, or using the famous cosmic ray argument (http://stackoverflow.com/questions/2580933/cosmic-rays-what-is-the-probability-they-will-affect-a-program), since we can never do better than that. {quote} ... Up until HDFS-4645, HDFS used randomly generated block IDs drawn from a far smaller space-- 2^64 – and we never had a problem. ... {quote} We did have collision check for random block IDs. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277749#comment-14277749 ] Colin Patrick McCabe commented on HDFS-7575: bq. Suresh wrote: I agree with Daryn Sharp that there isno need to change the layout here. Layout change is only necessary if the two layouts are not compatible and the downgrade does not work from newer release to older. Is that the case here? The new layout used in HDFS-6482 is backwards compatible, in the sense that older versions of hadoop can run with it. HDFS-6482 just added the invariant that block ID uniquely determines which subdir a block is in, but subdirs already existed. Does that mean we shouldn't have changed the layout version for HDFS-6482? I think the answer is clear. bq. Daryn wrote: Since we know duplicate storage ids are bad, I think the correct logic is to always sanity check the storage ids at startup. If there are collisions, then the storage should be updated. Rollback should not restore a bug by reverting the storage id to a dup. I'm surprised to hear you say that rollback should not be an option. It seems like the conservative thing to do here is to allow the user to restore to the VERSION file. Obviously we believe there will be no problems. But we always believe that, or else we wouldn't have made the change. Sometimes there are problems. bq. BTW, UUID.randomUUID isn't guaranteed to return a unique id. It's highly improbable, but possible, although more likely due to older storages, user copying a storage, etc. This is really not a good argument. Collisions in 128-bit space are extremely unlikely. You will never see one in your lifetime. Up until HDFS-4645, HDFS used randomly generated block IDs drawn from a far smaller space-- 2^64 -- and we never had a problem. Phrases like billions and billions and total number of grains of sand in the world don't begin to approach the size of 2^128. I think it's frustrating for storage IDs to change without warning just because HDFS was restarted. It will make diagnosing problems by reading log files harder because storageIDs might morph at any time. It also sets a bad precedent of not allowing downgrade and modifying VERSION files on the fly during startup. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278041#comment-14278041 ] Hadoop QA commented on HDFS-7575: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12692337/HDFS-7575.05.patch against trunk revision 7fe0f25. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 8 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-hdfs-project/hadoop-hdfs-nfs: org.apache.hadoop.ha.TestZKFailoverControllerStress org.apache.hadoop.hdfs.server.namenode.TestFileTruncate org.apache.hadoop.hdfs.tools.offlineEditsViewer.TestOfflineEditsViewer The following test timeouts occurred in hadoop-common-project/hadoop-common hadoop-hdfs-project/hadoop-hdfs hadoop-hdfs-project/hadoop-hdfs-nfs: org.apache.hadoop.ha.TestZKFailoverControllerStress org.apache.hadoop.hdfs.server.mover.TestStorageMover Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9213//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9213//console This message is automatically generated. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, HDFS-7575.05.binary.patch, HDFS-7575.05.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277222#comment-14277222 ] Daryn Sharp commented on HDFS-7575: --- I'm not an expert in this area, but I still question bumping the layout version. The layout isn't changing, just an existing value in the VERSION file. Since we know duplicate storage ids are bad, I think the correct logic is to always sanity check the storage ids at startup. If there are collisions, then the storage should be updated. Rollback should not restore a bug by reverting the storage id to a dup. BTW, {{UUID.randomUUID}} isn't guaranteed to return a unique id. It's _highly_ improbable, but possible, although more likely due to older storages, user copying a storage, etc. Although the storage ids are unique after the upgrade, if a disk is moved from one node to another, then a collision is possible. Hence another reason why I feel explicitly checking for collisions at startup should always be done. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277294#comment-14277294 ] Suresh Srinivas commented on HDFS-7575: --- I agree with [~daryn] that there isno need to change the layout here. Layout change is only necessary if the two layouts are not compatible and the downgrade does not work from newer release to older. Is that the case here? NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14276114#comment-14276114 ] Arpit Agarwal commented on HDFS-7575: - Any comments on the v04 patch? Be good to get this change in. Thanks. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14274124#comment-14274124 ] Colin Patrick McCabe commented on HDFS-7575: [~daryn]: I think a new layout version makes sense here. Basically we are going from a layout where the storageID might not have been unique, to one where it is. This is a change in the VERSION file. It's nice to have the same guarantees that we usually do (that if the ugprade fails, you can roll back via the {{previous}} directory, and so forth.) We could probably be more clever here and optimize this so we didn't have to hardlink the block files, but the upgrade path is already a little too clever and I think this is fine. Rather than calling the new layout version UPGRADE_GENERATES_STORAGE_IDS, how about calling it something like UNIQUE_STORAGE_IDS or GUARANTEED_UNIQUE_STORAGE_IDS? That describes what the new layout is, rather than what the process of upgrading is, consistent with our other layout version descriptions. {code} 110... = new ClusterVerifier() { 111 @Override 112 public void verifyClusterPostUpgrade(MiniDFSCluster cluster) throws IOException { 113 // Verify that a GUID-based storage ID was generated. 114 final String bpid = cluster.getNamesystem().getBlockPoolId(); 115 StorageReport[] reports = 116 cluster.getDataNodes().get(0).getFSDataset().getStorageReports(bpid); 117 assertThat(reports.length, is(1)); 118 final String storageID = reports[0].getStorage().getStorageID(); 119 assertTrue(DatanodeStorage.isValidStorageId(storageID)); 120 } {code} It seems like this exact code appears in 3 different tests. We should just make this Verifier a static object that's created once in the test or something? +1 once these are addressed. [~daryn], please take a look if you can... we'd really like to fix this one. Thanks, guys NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14274166#comment-14274166 ] Arpit Agarwal commented on HDFS-7575: - Thanks for reviewing. v04 patch addresses latest the feedback from Colin. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14274518#comment-14274518 ] Hadoop QA commented on HDFS-7575: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12691741/HDFS-7575.04.patch against trunk revision b78b4a1. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestDatanodeLayoutUpgradeGeneratesStorageID org.apache.hadoop.hdfs.server.balancer.TestBalancer Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9189//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9189//console This message is automatically generated. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, HDFS-7575.04.binary.patch, HDFS-7575.04.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273948#comment-14273948 ] Arpit Agarwal commented on HDFS-7575: - [~cmccabe], [~daryn], Are you okay with proceeding with the patch or are there any open questions you'd like to see addressed? Thanks. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271723#comment-14271723 ] Arpit Agarwal commented on HDFS-7575: - Patch that does not modify the existing test case to reduce binary diff. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271746#comment-14271746 ] Daryn Sharp commented on HDFS-7575: --- This is a general question, I don't have specific instances: Is there any lingering data that might also need to be cleaned up or removed after the upgrade to storage ids? Also, will the NN correctly adapt to the new storage ids? I think it will when the DN reregisters and sends full block reports. Need to be certain this is rolling upgrade safe. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271756#comment-14271756 ] Daryn Sharp commented on HDFS-7575: --- Also, is the layout version for UPGRADE_GENERATES_STORAGE_IDS necessary? The prior layout versions already work for single/multi storage ids and the new layout id doesn't seem to be referenced anywhere. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271782#comment-14271782 ] Arpit Agarwal commented on HDFS-7575: - bq. This is a general question, I don't have specific instances: Is there any lingering data that might also need to be cleaned up or removed after the upgrade to storage ids? The stale storages need to be cleaned up on the NN. This will be fixed by HDFS-7596. bq. Also, will the NN correctly adapt to the new storage ids? I think it will when the DN reregisters and sends full block reports. Need to be certain this is rolling upgrade safe. From my unit testing, the NN does handle the new storage ids and migrates blocks from the old storage to the new storage id as the block reports come in. bq. Also, is the layout version for UPGRADE_GENERATES_STORAGE_IDS necessary? The prior layout versions already work for single/multi storage ids and the new layout id doesn't seem to be referenced anywhere. Good question. For clusters previously upgraded from 2.2, we are technically changing the content of the VERSION files so a layout version change seemed warranted. Do you see any downside to doing so? Thanks for taking a look at the patch. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271997#comment-14271997 ] Hadoop QA commented on HDFS-7575: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12691413/testUpgradeFrom24PreservesStorageId.tgz against trunk revision ae91b13. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9172//console This message is automatically generated. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch, testUpgrade22via24GeneratesStorageIDs.tgz, testUpgradeFrom22GeneratesStorageIDs.tgz, testUpgradeFrom24PreservesStorageId.tgz Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14271963#comment-14271963 ] Hadoop QA commented on HDFS-7575: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12691363/HDFS-7575.03.patch against trunk revision ae91b13. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestDatanodeLayoutUpgradeGeneratesStorageID Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9166//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9166//console This message is automatically generated. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch, HDFS-7575.03.binary.patch, HDFS-7575.03.patch Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270221#comment-14270221 ] Arpit Agarwal commented on HDFS-7575: - The patch size looks large due to a directory structure change to a binary image for an existing test case. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270423#comment-14270423 ] Colin Patrick McCabe commented on HDFS-7575: Hi Arpit, Thanks for taking this one on. Is there any chance that you could do the patch without adding or moving binary files? It seems like the main thing we're testing here is just that when we start up, we're going to modify the VERSION files as expected. We shouldn't even need any block files to test that, right? Just a few mkdirs in a unit test. If we check in the existing code, this 1.7 MB commit becomes part of the repo's history forever which slows down downloads and git pulls. I also think that untarring things during a test is kind of sluggish as well. It seems like we never needed these tar files to begin with. We could just have the test open up the txt files and generate a temporary directory based on them. If you want the blocks to have contents, we could just generate them with a fixed random seed using java.util.Random, and always get the same contents. {code} if (this.layoutVersion HdfsConstants.DATANODE_LAYOUT_VERSION) { + + // Clusters previously upgraded from layout versions earlier than + // ADD_DATANODE_AND_STORAGE_UUIDS failed to correctly generate a + // new storage ID. We fix that now. + + boolean haveValidStorageId = + DataNodeLayoutVersion.supports( + LayoutVersion.Feature.ADD_DATANODE_AND_STORAGE_UUIDS, layoutVersion) + DatanodeStorage.isValidStorageId(sd.getStorageUuid()); + doUpgrade(datanode, sd, nsInfo); // upgrade - createStorageID(sd); + if (createStorageID(sd, !haveValidStorageId)) { +LOG.info(Generated new storageID + sd.getStorageUuid() + + for directory + sd.getRoot()); + } {code} It would be good to add some logging for the various cases here. If we are generating a new storage ID because the previous one was invalid, we should log that the previous one was invalid somewhere. {code} if (this.layoutVersion HdfsConstants.DATANODE_LAYOUT_VERSION) { {code} Is this if statement really valid? It seems like right now, there are clusters out there that are on the latest layout version, but which don't have valid storage IDs. We should either bump the NN layout version, or unconditionally check that the storage ID is valid, right? NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270527#comment-14270527 ] Arpit Agarwal commented on HDFS-7575: - Thanks for the review Colin. The three tar files for the newly added tests are less than each less than 15KB, one is less than 10KB. The reason the diff is large is because an existing tar file for the HDFS-6482 unit test is being modified. That tar file was about 600KB. I think I can rewrite the test case to not require the modification, so the diff size would be reasonable. {code} if (this.layoutVersion HdfsConstants.DATANODE_LAYOUT_VERSION) { {code} During upgrade {{this.layoutVersion}} is the pre-upgrade LV and {{HdfsConstants.DATANODE_LAYOUT_VERSION}} is the post-upgrade LV. Hence this check will always trigger when upgrading to from 2.6 or earlier to 2.7+, which is what we want. I'll add the logging in the next patch revision. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270229#comment-14270229 ] Hadoop QA commented on HDFS-7575: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12690971/HDFS-7575.01.patch against trunk revision 7e2d9a3. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9157//console This message is automatically generated. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270532#comment-14270532 ] Hadoop QA commented on HDFS-7575: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12690991/HDFS-7575.02.patch against trunk revision ae91b13. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-hdfs-project/hadoop-hdfs: org.apache.hadoop.hdfs.TestDatanodeLayoutUpgrade Test results: https://builds.apache.org/job/PreCommit-HDFS-Build/9158//testReport/ Console output: https://builds.apache.org/job/PreCommit-HDFS-Build/9158//console This message is automatically generated. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Attachments: HDFS-7575.01.patch, HDFS-7575.02.patch Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265903#comment-14265903 ] Lars Francke commented on HDFS-7575: I don't object at all, quite the opposite. Thanks for taking care of this. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265558#comment-14265558 ] Arpit Agarwal commented on HDFS-7575: - I'm testing a fix and expect to post a patch by tomorrow. It will also fix for the storageMap issue. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.4.0, 2.5.0, 2.6.0 Reporter: Lars Francke Assignee: Arpit Agarwal Priority: Critical Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264840#comment-14264840 ] Colin Patrick McCabe commented on HDFS-7575: I'm concerned that if storage ids are not unique, a lot of other bad things could happen. I don't think we should hack around this. I know the upgrade code path isn't fun but the alternatives are worse. Anywhere where someone is using a storage id, it could fail in mysterious ways on those older, improperly upgraded clusters. Our unit tests would not catch this since for newly installed clusters, the problem does not occur. And people are going to keep assuming that storage IDs are unique, because they're supposed to be. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Reporter: Lars Francke Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264861#comment-14264861 ] Daryn Sharp commented on HDFS-7575: --- I completely agree with Colin regarding an upgrade path. Kihwal and I have had concerns about the shared storage id for quite awhile now, have discussed how to auto-upgrade old storage dirs, but have not had the cycles to do it. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Reporter: Lars Francke Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264896#comment-14264896 ] Lars Francke commented on HDFS-7575: Okay, agreed. Thanks for the input. I'm afraid that I won't have time to learn this code and provide a fix. If anyone else could step up that'd be much appreciated. The second part of the described problem will still happen though and needs to be fixed in the NN Heartbeat code: Old storageIds will never be pruned at the moment. I suggest not updating the {{storageMap}} in {{DatanodeDescriptor}} but overwriting it with what the latest Heartbeat gave us. Does that sound sensible or am I missing something? I'll open a separate issue for the upgrade changes. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Reporter: Lars Francke Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264982#comment-14264982 ] Arpit Agarwal commented on HDFS-7575: - [~lars_francke], thanks for reporting this bug and the thorough investigation. The correct fix is to generate storage IDs as part of the upgrade as Colin said. I thought I had handled this case in HDFS-2832. Assigned it to myself since I broke it. Let me know if you object. bq. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. We can fix it in a separate Jira. I don't think just overwriting storageMap is correct though. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Reporter: Lars Francke Assignee: Arpit Agarwal Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263528#comment-14263528 ] Lars Francke commented on HDFS-7575: Agreed, could also be seen as an upgrade problem. I could probably prepare a patch that fixes the NameNode handling in the way I described. It would make the Balancer work again. I don't think I feel comfortable enough with the upgrade code though. What do you think? NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Reporter: Lars Francke Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14261055#comment-14261055 ] Lars Francke commented on HDFS-7575: I worked around this by doing the following for each DataNode: * Stop the DataNode * Change the storageId in each storage directory (it's in the VERSION file, e.g. {{/mnt/disk1/dfs/dn/current/VERSION}}) to a unique value * Start the DataNode Then afterwards I restarted the Standby NN (NN2), failed over manually, restarted the new Standby NN (NN1). The Balancer seems to run fine since then. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Reporter: Lars Francke Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-7575) NameNode not handling heartbeats properly after HDFS-2832
[ https://issues.apache.org/jira/browse/HDFS-7575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14261504#comment-14261504 ] Colin Patrick McCabe commented on HDFS-7575: This seems like an upgrade problem. Each directory should have its own storage id. It seems like we should fix the upgrade code to make sure that this is the case. If necessary, that means we should generate new codes for some directories. NameNode not handling heartbeats properly after HDFS-2832 - Key: HDFS-7575 URL: https://issues.apache.org/jira/browse/HDFS-7575 Project: Hadoop HDFS Issue Type: Bug Reporter: Lars Francke Before HDFS-2832 each DataNode would have a unique storageId which included its IP address. Since HDFS-2832 the DataNodes have a unique storageId per storage directory which is just a random UUID. They send reports per storage directory in their heartbeats. This heartbeat is processed on the NameNode in the {{DatanodeDescriptor#updateHeartbeatState}} method. Pre HDFS-2832 this would just store the information per Datanode. After the patch though each DataNode can have multiple different storages so it's stored in a map keyed by the storage Id. This works fine for all clusters that have been installed post HDFS-2832 as they get a UUID for their storage Id. So a DN with 8 drives has a map with 8 different keys. On each Heartbeat the Map is searched and updated ({{DatanodeStorageInfo storage = storageMap.get(s.getStorageID());}}): {code:title=DatanodeStorageInfo} void updateState(StorageReport r) { capacity = r.getCapacity(); dfsUsed = r.getDfsUsed(); remaining = r.getRemaining(); blockPoolUsed = r.getBlockPoolUsed(); } {code} On clusters that were upgraded from a pre HDFS-2832 version though the storage Id has not been rewritten (at least not on the four clusters I checked) so each directory will have the exact same storageId. That means there'll be only a single entry in the {{storageMap}} and it'll be overwritten by a random {{StorageReport}} from the DataNode. This can be seen in the {{updateState}} method above. This just assigns the capacity from the received report, instead it should probably sum it up per received heartbeat. The Balancer seems to be one of the only things that actually uses this information so it now considers the utilization of a random drive per DataNode for balancing purposes. Things get even worse when a drive has been added or replaced as this will now get a new storage Id so there'll be two entries in the storageMap. As new drives are usually empty it skewes the balancers decision in a way that this node will never be considered over-utilized. Another problem is that old StorageReports are never removed from the storageMap. So if I replace a drive and it gets a new storage Id the old one will still be in place and used for all calculations by the Balancer until a restart of the NameNode. I can try providing a patch that does the following: * Instead of using a Map I could just store the array we receive or instead of storing an array sum up the values for reports with the same Id * On each heartbeat clear the map (so we know we have up to date information) Does that sound sensible? -- This message was sent by Atlassian JIRA (v6.3.4#6332)