[ https://issues.apache.org/jira/browse/HDDS-2446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972318#comment-16972318 ]
Stephen O'Donnell commented on HDDS-2446: ----------------------------------------- I looked into the code a bit more to double check this area. The only place outside of tests where a DatanodeInfo object gets created is via SCMNodeMananger.register() -> nodeStateManager.addNode() -> Here it creates the new datanodeInfo. So far as I can tell, nothing cleans a registered node (DatanodeDetails or datanodeInfo) out of SCM except a restart - it will remember all nodes which have previously registered with it. If a node re-registers, the above chain of calls will give a NodeAlreadyExists exception on registration, which is caught and a success is still returned to the DN. If a node goes dead, then all its containers will be purged, but if it re-registers without being dead, the containers will still be present referencing the old DatanodeInfo object, which will not have changed. One thing we could do, is purge the container list on re-registration, as the register command should have a container report which must be processed anyway. As an aside, I wonder if there is a bug in the re-registration process - the way SCM checks if a node has already registered, is to look it up by UUID. If a DN is stopped and changes its IP or hostname, but retains the UUID, then it will 'register' successfully but the datanodeDetails information will not be updated if any of it has changed. {code} public RegisteredCommand register( DatanodeDetails datanodeDetails, NodeReportProto nodeReport, PipelineReportsProto pipelineReportsProto) { InetAddress dnAddress = Server.getRemoteIp(); if (dnAddress != null) { // Mostly called inside an RPC, update ip and peer hostname datanodeDetails.setHostName(dnAddress.getHostName()); datanodeDetails.setIpAddress(dnAddress.getHostAddress()); } try { String dnsName; String networkLocation; datanodeDetails.setNetworkName(datanodeDetails.getUuidString()); if (useHostname) { dnsName = datanodeDetails.getHostName(); } else { dnsName = datanodeDetails.getIpAddress(); } networkLocation = nodeResolve(dnsName); if (networkLocation != null) { datanodeDetails.setNetworkLocation(networkLocation); } nodeStateManager.addNode(datanodeDetails); clusterMap.add(datanodeDetails); addEntryTodnsToUuidMap(dnsName, datanodeDetails.getUuidString()); // Updating Node Report, as registration is successful processNodeReport(datanodeDetails, nodeReport); LOG.info("Registered Data node : {}", datanodeDetails); } catch (NodeAlreadyExistsException e) { if (LOG.isTraceEnabled()) { LOG.trace("Datanode is already registered. Datanode: {}", datanodeDetails.toString()); } } return RegisteredCommand.newBuilder().setErrorCode(ErrorCode.success) .setDatanode(datanodeDetails) .setClusterID(this.scmStorageConfig.getClusterID()) .build(); } {code} We should probably open another Jira if this bug is potentially there, but we may need to look at re-registration for maintenance mode anyway, as that will involve a node going dead, NOT clearing its replicas out, and then it registering again. > ContainerReplica should contain DatanodeInfo rather than DatanodeDetails > ------------------------------------------------------------------------ > > Key: HDDS-2446 > URL: https://issues.apache.org/jira/browse/HDDS-2446 > Project: Hadoop Distributed Data Store > Issue Type: Sub-task > Components: SCM > Affects Versions: 0.5.0 > Reporter: Stephen O'Donnell > Assignee: Stephen O'Donnell > Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The ContainerReplica object is used by the SCM to track containers reported > by the datanodes. The current fields stored in ContainerReplica are: > {code} > final private ContainerID containerID; > final private ContainerReplicaProto.State state; > final private DatanodeDetails datanodeDetails; > final private UUID placeOfBirth; > {code} > Now we have introduced decommission and maintenance mode, the replication > manager (and potentially other parts of the code) need to know the status of > the replica in terms of IN_SERVICE, DECOMMISSIONING, DECOMMISSIONED etc to > make replication decisions. > The DatanodeDetails object does not carry this information, however the > DatanodeInfo object extends DatanodeDetails and does carry the required > information. > As DatanodeInfo extends DatanodeDetails, any place which needs a > DatanodeDetails can accept a DatanodeInfo instead. > In this Jira I propose we change the DatanodeDetails stored in > ContainerReplica to DatanodeInfo. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org