[jira] [Commented] (HDFS-3744) Decommissioned nodes are included in cluster after switch which is not expected
[ https://issues.apache.org/jira/browse/HDFS-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428200#comment-13428200 ] Raju commented on HDFS-3744: Oh Yesterday was my bad day @work :( Let me correct my second opinion first {quote}2. I would like to with your first opinion with STANDBY check in replication(or move replication to Active service). {quote} Here first opinion I meant here is Uma Maheshwara Rao's opinion. I accept that DFSAdmin command need to be sent for both NN. This can solve the problem here. That's u also meant Arron. And I would like to add Standby check at replication monitor to avoid load in cluster. My First opinion Before to support my opinion of persisting the node decommisioned consider the scenarios where {quote} Decommission command is given by admin but Standby NN is down ...etc Scenarios where Standby NN is not available when issuing command. {quote} DFSAdmin will just create a DFS client and calls refreshNodes()(Client will retry for NN 10 times by default not forever if I am not wrong), in this case Standby NN will not get to know the Decommission request and when Standby is up and switches to active??? I guess It will consider the decommissioned node as well. {quote} I'm hesitant to go with this suggestion. How would differences be rectified between what's persisted in the edit log and what's present in the excluded hosts file? {quote} Initially I also thought of same thing. Since we are persisting the message in different forms, but consider {quote} Admin could have just configured the exclude nodes in the file, but he wouldn't have issued refreshnode command. {quote} By persisting into edit logs we can be sure of which DN is decommissioned? Not only by Standby NN but also when Standalone NN restarts. Or is there any way NN can identify Decommissioned node without persisting? Here I mean to persist all the stages of recommission and decommission too. Please correct me If I am not correct. I will be very glad to hear more about my opinion of persisting. Decommissioned nodes are included in cluster after switch which is not expected --- Key: HDFS-3744 URL: https://issues.apache.org/jira/browse/HDFS-3744 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.0.0-alpha, 2.1.0-alpha, 2.0.1-alpha Reporter: Brahma Reddy Battula Scenario: = Start ANN and SNN with three DN's Exclude DN1 from cluster by using decommission feature (./hdfs dfsadmin -fs hdfs://ANNIP:8020 -refreshNodes) After decommission successful,do switch such that SNN will become Active. Here exclude node(DN1) is included in cluster.Able to write files to excluded node since it's not excluded. Checked SNN(Which Active before switch) UI decommissioned=1 and ANN UI decommissioned=0 One more Observation: All dfsadmin commands will create proxy only on nn1 irrespective of Active or standby.I think this also we need to re-look once.. I am not getting , why we are not given HA for dfsadmin commands..? Please correct me,,If I am wrong. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3744) Decommissioned nodes are included in cluster after switch which is not expected
[ https://issues.apache.org/jira/browse/HDFS-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13428228#comment-13428228 ] Raju commented on HDFS-3744: I would like to add some scenarios which can be handled by persisting 1. If N/W is unreachable from DFSClient issuing refreshNodes and Standby. 2. If N/W is unreachable from DFSClient issuing refreshNodes and active, refreshnodes command reached Standby NN and it marks DECOMMISSION in progress and NN N/W is reachable. and other N/W down scenarios as well. Decommissioned nodes are included in cluster after switch which is not expected --- Key: HDFS-3744 URL: https://issues.apache.org/jira/browse/HDFS-3744 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.0.0-alpha, 2.1.0-alpha, 2.0.1-alpha Reporter: Brahma Reddy Battula Scenario: = Start ANN and SNN with three DN's Exclude DN1 from cluster by using decommission feature (./hdfs dfsadmin -fs hdfs://ANNIP:8020 -refreshNodes) After decommission successful,do switch such that SNN will become Active. Here exclude node(DN1) is included in cluster.Able to write files to excluded node since it's not excluded. Checked SNN(Which Active before switch) UI decommissioned=1 and ANN UI decommissioned=0 One more Observation: All dfsadmin commands will create proxy only on nn1 irrespective of Active or standby.I think this also we need to re-look once.. I am not getting , why we are not given HA for dfsadmin commands..? Please correct me,,If I am wrong. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3735) [ NNUI -- NNJspHelper.java ] Last three fields not considered for display data in sorting
[ https://issues.apache.org/jira/browse/HDFS-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13427400#comment-13427400 ] Raju commented on HDFS-3735: Hey Brahma, {code} case FIELD_FAILED_VOL: ret = d1.getVolumeFailures() - d2.getVolumeFailures(); break; {code} Since we are using this code in comparator its good to return -1,0 or 1 as its return value. Even without that with current fix it will work. But in general the mentioned values are returned. [ NNUI -- NNJspHelper.java ] Last three fields not considered for display data in sorting -- Key: HDFS-3735 URL: https://issues.apache.org/jira/browse/HDFS-3735 Project: Hadoop HDFS Issue Type: Bug Components: name-node Affects Versions: 2.0.0-alpha, 2.0.1-alpha Reporter: Brahma Reddy Battula Priority: Minor Live datanode list is not correctly sorted for columns Block Pool Used (GB), Block Pool Used (%) and Failed Volumes. Read comments for more details. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3744) Decommissioned nodes are included in cluster after switch which is not expected
[ https://issues.apache.org/jira/browse/HDFS-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13427404#comment-13427404 ] Raju commented on HDFS-3744: I have 2 suggestions to handle the same issue 1. Persist the NODE_DECOMMISSIONED by active so SNN will get node DECOMMISSIONED. 2. I would like to with your first opinion with SAFEMODE check in replication(or move replication to Active service). Decommissioned nodes are included in cluster after switch which is not expected --- Key: HDFS-3744 URL: https://issues.apache.org/jira/browse/HDFS-3744 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.0.0-alpha, 2.1.0-alpha, 2.0.1-alpha Reporter: Brahma Reddy Battula Scenario: = Start ANN and SNN with three DN's Exclude DN1 from cluster by using decommission feature (./hdfs dfsadmin -fs hdfs://ANNIP:8020 -refreshNodes) After decommission successful,do switch such that SNN will become Active. Here exclude node(DN1) is included in cluster.Able to write files to excluded node since it's not excluded. Checked SNN(Which Active before switch) UI decommissioned=1 and ANN UI decommissioned=0 One more Observation: All dfsadmin commands will create proxy only on nn1 irrespective of Active or standby.I think this also we need to re-look once.. I am not getting , why we are not given HA for dfsadmin commands..? Please correct me,,If I am wrong. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3744) Decommissioned nodes are included in cluster after switch which is not expected
[ https://issues.apache.org/jira/browse/HDFS-3744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13426775#comment-13426775 ] Raju commented on HDFS-3744: {quote}We have to execute this particular command on both the nodes. Otherwise, even though we make it failover work here, still SNN will not know about the excluded nodes as it did not get any refresh nodes command. {quote} Until SNN gets refreshnodes command SNN will not be aware of decommission of DN. But if we send the command to SNN as well which will cause the REPLICATION triggered in the SNN as well. Even though DN will not accept any requests from SNN, but SNN will be trying to replicate the blocks which will increase the load on SNN + N/W load + DN load as well. This is my initial opinion, Please let me know if my opinion is wrong. Decommissioned nodes are included in cluster after switch which is not expected --- Key: HDFS-3744 URL: https://issues.apache.org/jira/browse/HDFS-3744 Project: Hadoop HDFS Issue Type: Bug Components: ha Affects Versions: 2.0.0-alpha, 2.1.0-alpha, 2.0.1-alpha Reporter: Brahma Reddy Battula Scenario: = Start ANN and SNN with three DN's Exclude DN1 from cluster by using decommission feature (./hdfs dfsadmin -fs hdfs://ANNIP:8020 -refreshNodes) After decommission successful,do switch such that SNN will become Active. Here exclude node(DN1) is included in cluster.Able to write files to excluded node since it's not excluded. Checked SNN(Which Active before switch) UI decommissioned=1 and ANN UI decommissioned=0 One more Observation: All dfsadmin commands will create proxy only on nn1 irrespective of Active or standby.I think this also we need to re-look once.. I am not getting , why we are not given HA for dfsadmin commands..? Please correct me,,If I am wrong. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3747) In some TestCace used in maven. The balancer-Test failed because of TimeOut
[ https://issues.apache.org/jira/browse/HDFS-3747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13426777#comment-13426777 ] Raju commented on HDFS-3747: I guess, in this problem we need to worry about why source node does not have the blocks. Can u post the logs which could be more helpful and which test u executed. In some TestCace used in maven. The balancer-Test failed because of TimeOut --- Key: HDFS-3747 URL: https://issues.apache.org/jira/browse/HDFS-3747 Project: Hadoop HDFS Issue Type: Test Components: balancer Affects Versions: 2.1.0-alpha, 3.0.0 Reporter: meng gong Labels: test Fix For: 2.1.0-alpha, 3.0.0 When run a given test case for Banlancer Test. In the test the banlancer thread try to move some block cross the rack but it can't find any available blocks in the source rack. Then the thread won't interrupt until the tag isTimeUp reaches 20min. But maven judges the test failed because the thread have runned for 15min. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3734) TestFSEditLogLoader.testReplicationAdjusted() will hang if number of blocks are more than one
[ https://issues.apache.org/jira/browse/HDFS-3734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13426783#comment-13426783 ] Raju commented on HDFS-3734: This is because of improper calculation of the blocksThreshold if I am not wrong TestFSEditLogLoader.testReplicationAdjusted() will hang if number of blocks are more than one - Key: HDFS-3734 URL: https://issues.apache.org/jira/browse/HDFS-3734 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.1.0-alpha, 3.0.0 Reporter: Vinay TestFSEditLogLoader.testReplicationAdjusted() which was added in HDFS-2003 will fail if number of blocks before cluster restart are more than one. Test Scenario: -- 1. Write a file with min replication as 1 and replication factor as 1. 2. Change the min replication to 2 and restart the cluster. Expected: Min replication should be automatically reset on cluster restart by replicating more blocks. Currently, if the number of blocks before restart is only one, then on restart NN will not enter safemode, hence replication will happen and satisfies min replication factor. If initial blocks count is more than 1 which are having replication factor as 1, then on restart NN will enter safemode and will never come out. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3731) 2.0 release upgrade must handle blocks being written from 1.0
[ https://issues.apache.org/jira/browse/HDFS-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425849#comment-13425849 ] Raju commented on HDFS-3731: {quote}We should probably hook the finalize code to also rm -rf the blocksbeingwritten directory, or else the storage will be leaked forever, right?{quote} Yes Todd forget to mention about finalize, we need to delete the BBW dir in finalize {quote}If you agree, may be we can file separate JIRA for that, as this JIRA mainly talking about 2.0 upgrade from 1.0. {quote} Uma I accept your opinion but we need to modify the code which is already released, I am not very clear on how to fix the same. {quote}I thought that hardlinks to directories are not typically supported ...{quote} Robert Here I am not directly referring to Hardlink like {code} ln sourceDir destDir {code} I am talking about using {code} void org.apache.hadoop.hdfs.server.datanode.DataStorage.linkBlocks(File from, File to, int oldLV, HardLink hl) {code} which will do the individual file linking for all the blocks. 2.0 release upgrade must handle blocks being written from 1.0 - Key: HDFS-3731 URL: https://issues.apache.org/jira/browse/HDFS-3731 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 2.0.0-alpha Reporter: Suresh Srinivas Assignee: Todd Lipcon Priority: Blocker Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 release. Problem reported by Brahma Reddy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-2956) calling fetchdt without a --renewer argument throws NPE
[ https://issues.apache.org/jira/browse/HDFS-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425861#comment-13425861 ] Raju commented on HDFS-2956: Here we are defining the protocol message as {code} message GetDelegationTokenRequestProto { required string renewer = 1; } {code} based on some of the above comments I feel the renewer should be optional (since null can be passed I mean we are not providing renewer). Even with optional we will have the generated class with null check for renewer, so I guess we can have null check for renewer like {code} if(renewer != null) { GetDelegationTokenRequestProto req = GetDelegationTokenRequestProto .newBuilder() .setRenewer(renewer.toString()) .build(); } else { GetDelegationTokenRequestProto req = GetDelegationTokenRequestProto .newBuilder() .build(); } {code} This should be possible since we are declaring the renewer optional, similarly we can parse the message back at serverside translator. Please correct me if I am wrong calling fetchdt without a --renewer argument throws NPE --- Key: HDFS-2956 URL: https://issues.apache.org/jira/browse/HDFS-2956 Project: Hadoop HDFS Issue Type: Bug Components: security Affects Versions: 0.24.0 Reporter: Todd Lipcon Assignee: Daryn Sharp If I call bin/hdfs fetchdt /tmp/mytoken without a --renewer foo argument, then it will throw a NullPointerException: Exception in thread main java.lang.NullPointerException at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDelegationToken(ClientNamenodeProtocolTranslatorPB.java:830) this is because getDelegationToken is being called with a null renewer -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HDFS-3731) 2.0 release upgrade must handle blocks being written from 1.0
[ https://issues.apache.org/jira/browse/HDFS-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13425015#comment-13425015 ] Raju commented on HDFS-3731: In the upgrade process we usually rename the old folder to previous folder but here we have 2 folders (one contain finalised blocks and other in BBW) to handle ,I would like to propose sd.root/previous sd.root/current/BPID/current/finalized sd.root/blocksbeingwritten sd.root/current/BPID/current/rbw NOTE: hardlink NOTE: blocksbeingwritten folder is not renamed since this can cause rollback problem with old versions. With the above fix there are some advantages 1. Simple change in existing code(2.X +) code, 2. No change required for old version code.(1.X -), 3. Rollback will work without any new effort, Please correct me if I am wrong. 2.0 release upgrade must handle blocks being written from 1.0 - Key: HDFS-3731 URL: https://issues.apache.org/jira/browse/HDFS-3731 Project: Hadoop HDFS Issue Type: Bug Components: data-node Affects Versions: 2.0.0-alpha Reporter: Suresh Srinivas Assignee: Todd Lipcon Priority: Blocker Release 2.0 upgrades must handle blocks being written to (bbw) files from 1.0 release. Problem reported by Brahma Reddy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira