[jira] [Commented] (HDFS-10477) Stop decommission a rack of DataNodes caused NameNode fail over to standby
[ https://issues.apache.org/jira/browse/HDFS-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16803199#comment-16803199 ] yunjiong zhao commented on HDFS-10477: -- [~jojochuang], I don't mind, please go ahead. Thank you. > Stop decommission a rack of DataNodes caused NameNode fail over to standby > -- > > Key: HDFS-10477 > URL: https://issues.apache.org/jira/browse/HDFS-10477 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2 >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Major > Attachments: HDFS-10477.002.patch, HDFS-10477.003.patch, > HDFS-10477.004.patch, HDFS-10477.005.patch, HDFS-10477.patch > > > In our cluster, when we stop decommissioning a rack which have 46 DataNodes, > it locked Namesystem for about 7 minutes as below log shows: > {code} > 2016-05-26 20:11:41,697 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.27:1004 > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 285258 over-replicated blocks on 10.142.27.27:1004 during recommissioning > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.118:1004 > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 279923 over-replicated blocks on 10.142.27.118:1004 during recommissioning > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.113:1004 > 2016-05-26 20:12:09,007 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 294307 over-replicated blocks on 10.142.27.113:1004 during recommissioning > 2016-05-26 20:12:09,008 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.117:1004 > 2016-05-26 20:12:18,055 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 314381 over-replicated blocks on 10.142.27.117:1004 during recommissioning > 2016-05-26 20:12:18,056 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.130:1004 > 2016-05-26 20:12:25,938 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 272779 over-replicated blocks on 10.142.27.130:1004 during recommissioning > 2016-05-26 20:12:25,939 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.121:1004 > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 287248 over-replicated blocks on 10.142.27.121:1004 during recommissioning > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.33:1004 > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 299868 over-replicated blocks on 10.142.27.33:1004 during recommissioning > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.137:1004 > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 303914 over-replicated blocks on 10.142.27.137:1004 during recommissioning > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.51:1004 > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 281175 over-replicated blocks on 10.142.27.51:1004 during recommissioning > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.12:1004 > 2016-05-26 20:13:08,756 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 274880 over-replicated blocks on 10.142.27.12:1004 during recommissioning > 2016-05-26 20:13:08,757 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.15:1004 > 2016-05-26 20:13:17,185 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 286334 over-replicated blocks on 10.142.27.15:1004 during recommissioning > 2016-05-26 20:13:17,185 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.14:1004 > 2016-05-26 20:13:25,369 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 280219 over-replicated blocks on 10.142.27.14:1004 during recommissioning > 2016-05-26 20:13:25,370 INFO > o
[jira] [Commented] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped
[ https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16443349#comment-16443349 ] yunjiong zhao commented on HDFS-13441: -- [~daryn] , you are right, it's not the best and reliable way to fix this issue. After some rethink, I think one line code should fix this issue. When NameNode startActiveServices, it will call {code:java} blockManager.getDatanodeManager().markAllDatanodesStale(); {code} Inside markAllDatanodesStale, add one line code to make sure DataNode have the current key from active NameNode. {code:java} dn.setNeedKeyUpdate(true); {code} > DataNode missed BlockKey update from NameNode due to HeartbeatResponse was > dropped > -- > > Key: HDFS-13441 > URL: https://issues.apache.org/jira/browse/HDFS-13441 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.7.1 >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Major > Attachments: HDFS-13441.002.patch, HDFS-13441.003.patch, > HDFS-13441.patch > > > After NameNode failover, lots of application failed due to some DataNodes > can't re-compute password from block token. > {code:java} > 2018-04-11 20:10:52,448 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error > processing unknown operation src: /10.142.74.116:57404 dst: > /10.142.77.45:50010 > javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password > [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=1523538652448, > keyId=1762737944, userId=hadoop, > blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, > access modes=[WRITE]), since the required block key (keyID=1762737944) > doesn't exist.] > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598) > at > com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=1523538652448, > keyId=1762737944, userId=hadoop, > blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, > access modes=[WRITE]), since the required block key (keyID=1762737944) > doesn't exist. > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382) > at > org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241) > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589) > ... 7 more > {code} > > In the DataNode log, we didn't see DataNode update block keys around > 2018-04-11 09:55:00 and around 2018-04-11 19:55:00. > {code:java} > 2018-04-10 14:51:36,424 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-10 23:55:38,420 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 00:51:34,792 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 10:51:39,403 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys
[jira] [Updated] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped
[ https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-13441: - Attachment: HDFS-13441.003.patch > DataNode missed BlockKey update from NameNode due to HeartbeatResponse was > dropped > -- > > Key: HDFS-13441 > URL: https://issues.apache.org/jira/browse/HDFS-13441 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.7.1 >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Major > Attachments: HDFS-13441.002.patch, HDFS-13441.003.patch, > HDFS-13441.patch > > > After NameNode failover, lots of application failed due to some DataNodes > can't re-compute password from block token. > {code:java} > 2018-04-11 20:10:52,448 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error > processing unknown operation src: /10.142.74.116:57404 dst: > /10.142.77.45:50010 > javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password > [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=1523538652448, > keyId=1762737944, userId=hadoop, > blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, > access modes=[WRITE]), since the required block key (keyID=1762737944) > doesn't exist.] > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598) > at > com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=1523538652448, > keyId=1762737944, userId=hadoop, > blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, > access modes=[WRITE]), since the required block key (keyID=1762737944) > doesn't exist. > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382) > at > org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241) > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589) > ... 7 more > {code} > > In the DataNode log, we didn't see DataNode update block keys around > 2018-04-11 09:55:00 and around 2018-04-11 19:55:00. > {code:java} > 2018-04-10 14:51:36,424 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-10 23:55:38,420 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 00:51:34,792 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 10:51:39,403 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 20:51:44,422 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-12 02:54:47,855 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-12 05:55:44,456 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > {code} > The reason is there is SocketTimeOutException when sending heartbeat
[jira] [Commented] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped
[ https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16441110#comment-16441110 ] yunjiong zhao commented on HDFS-13441: -- [~hexiaoqiao], DataNode can't use NamenodeProtocol. > DataNode missed BlockKey update from NameNode due to HeartbeatResponse was > dropped > -- > > Key: HDFS-13441 > URL: https://issues.apache.org/jira/browse/HDFS-13441 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.7.1 >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Major > Attachments: HDFS-13441.002.patch, HDFS-13441.patch > > > After NameNode failover, lots of application failed due to some DataNodes > can't re-compute password from block token. > {code:java} > 2018-04-11 20:10:52,448 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error > processing unknown operation src: /10.142.74.116:57404 dst: > /10.142.77.45:50010 > javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password > [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=1523538652448, > keyId=1762737944, userId=hadoop, > blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, > access modes=[WRITE]), since the required block key (keyID=1762737944) > doesn't exist.] > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598) > at > com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=1523538652448, > keyId=1762737944, userId=hadoop, > blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, > access modes=[WRITE]), since the required block key (keyID=1762737944) > doesn't exist. > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382) > at > org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241) > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589) > ... 7 more > {code} > > In the DataNode log, we didn't see DataNode update block keys around > 2018-04-11 09:55:00 and around 2018-04-11 19:55:00. > {code:java} > 2018-04-10 14:51:36,424 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-10 23:55:38,420 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 00:51:34,792 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 10:51:39,403 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 20:51:44,422 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-12 02:54:47,855 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-12 05:55:44,456 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > {code} > The reason is there i
[jira] [Commented] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped
[ https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16440104#comment-16440104 ] yunjiong zhao commented on HDFS-13441: -- Unit test failure is not related to this patch. > DataNode missed BlockKey update from NameNode due to HeartbeatResponse was > dropped > -- > > Key: HDFS-13441 > URL: https://issues.apache.org/jira/browse/HDFS-13441 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.7.1 >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Major > Attachments: HDFS-13441.002.patch, HDFS-13441.patch > > > After NameNode failover, lots of application failed due to some DataNodes > can't re-compute password from block token. > {code:java} > 2018-04-11 20:10:52,448 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error > processing unknown operation src: /10.142.74.116:57404 dst: > /10.142.77.45:50010 > javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password > [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=1523538652448, > keyId=1762737944, userId=hadoop, > blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, > access modes=[WRITE]), since the required block key (keyID=1762737944) > doesn't exist.] > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598) > at > com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=1523538652448, > keyId=1762737944, userId=hadoop, > blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, > access modes=[WRITE]), since the required block key (keyID=1762737944) > doesn't exist. > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382) > at > org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241) > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589) > ... 7 more > {code} > > In the DataNode log, we didn't see DataNode update block keys around > 2018-04-11 09:55:00 and around 2018-04-11 19:55:00. > {code:java} > 2018-04-10 14:51:36,424 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-10 23:55:38,420 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 00:51:34,792 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 10:51:39,403 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 20:51:44,422 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-12 02:54:47,855 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-12 05:55:44,456 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > {code} > The reason is there is So
[jira] [Commented] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped
[ https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439905#comment-16439905 ] yunjiong zhao commented on HDFS-13441: -- Upload HDFS-13441.002.patch, which contain unit test. > DataNode missed BlockKey update from NameNode due to HeartbeatResponse was > dropped > -- > > Key: HDFS-13441 > URL: https://issues.apache.org/jira/browse/HDFS-13441 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.7.1 >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Major > Attachments: HDFS-13441.002.patch, HDFS-13441.patch > > > After NameNode failover, lots of application failed due to some DataNodes > can't re-compute password from block token. > {code:java} > 2018-04-11 20:10:52,448 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error > processing unknown operation src: /10.142.74.116:57404 dst: > /10.142.77.45:50010 > javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password > [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=1523538652448, > keyId=1762737944, userId=hadoop, > blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, > access modes=[WRITE]), since the required block key (keyID=1762737944) > doesn't exist.] > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598) > at > com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=1523538652448, > keyId=1762737944, userId=hadoop, > blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, > access modes=[WRITE]), since the required block key (keyID=1762737944) > doesn't exist. > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382) > at > org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241) > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589) > ... 7 more > {code} > > In the DataNode log, we didn't see DataNode update block keys around > 2018-04-11 09:55:00 and around 2018-04-11 19:55:00. > {code:java} > 2018-04-10 14:51:36,424 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-10 23:55:38,420 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 00:51:34,792 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 10:51:39,403 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 20:51:44,422 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-12 02:54:47,855 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-12 05:55:44,456 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > {code} > The reason is there
[jira] [Commented] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped
[ https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439897#comment-16439897 ] yunjiong zhao commented on HDFS-13441: -- [~hexiaoqiao] , Let DataNode pull Block Key from NameNode is one choice, since it need change protocol, if I didn't understand wrong, it will need go to Hadoop 4. > DataNode missed BlockKey update from NameNode due to HeartbeatResponse was > dropped > -- > > Key: HDFS-13441 > URL: https://issues.apache.org/jira/browse/HDFS-13441 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.7.1 >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Major > Attachments: HDFS-13441.002.patch, HDFS-13441.patch > > > After NameNode failover, lots of application failed due to some DataNodes > can't re-compute password from block token. > {code:java} > 2018-04-11 20:10:52,448 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error > processing unknown operation src: /10.142.74.116:57404 dst: > /10.142.77.45:50010 > javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password > [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=1523538652448, > keyId=1762737944, userId=hadoop, > blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, > access modes=[WRITE]), since the required block key (keyID=1762737944) > doesn't exist.] > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598) > at > com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=1523538652448, > keyId=1762737944, userId=hadoop, > blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, > access modes=[WRITE]), since the required block key (keyID=1762737944) > doesn't exist. > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382) > at > org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241) > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589) > ... 7 more > {code} > > In the DataNode log, we didn't see DataNode update block keys around > 2018-04-11 09:55:00 and around 2018-04-11 19:55:00. > {code:java} > 2018-04-10 14:51:36,424 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-10 23:55:38,420 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 00:51:34,792 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 10:51:39,403 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 20:51:44,422 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-12 02:54:47,855 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-12 05:55:44,456 INFO > org.apa
[jira] [Updated] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped
[ https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-13441: - Attachment: HDFS-13441.002.patch > DataNode missed BlockKey update from NameNode due to HeartbeatResponse was > dropped > -- > > Key: HDFS-13441 > URL: https://issues.apache.org/jira/browse/HDFS-13441 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.7.1 >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Major > Attachments: HDFS-13441.002.patch, HDFS-13441.patch > > > After NameNode failover, lots of application failed due to some DataNodes > can't re-compute password from block token. > {code:java} > 2018-04-11 20:10:52,448 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error > processing unknown operation src: /10.142.74.116:57404 dst: > /10.142.77.45:50010 > javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password > [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=1523538652448, > keyId=1762737944, userId=hadoop, > blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, > access modes=[WRITE]), since the required block key (keyID=1762737944) > doesn't exist.] > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598) > at > com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=1523538652448, > keyId=1762737944, userId=hadoop, > blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, > access modes=[WRITE]), since the required block key (keyID=1762737944) > doesn't exist. > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382) > at > org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241) > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589) > ... 7 more > {code} > > In the DataNode log, we didn't see DataNode update block keys around > 2018-04-11 09:55:00 and around 2018-04-11 19:55:00. > {code:java} > 2018-04-10 14:51:36,424 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-10 23:55:38,420 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 00:51:34,792 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 10:51:39,403 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 20:51:44,422 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-12 02:54:47,855 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-12 05:55:44,456 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > {code} > The reason is there is SocketTimeOutException when sending heartbeat to > StandbyNameNode > {
[jira] [Commented] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped
[ https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438531#comment-16438531 ] yunjiong zhao commented on HDFS-13441: -- [~hexiaoqiao] , this issue is different, it is not about DN register to NN, it is about lost heartbeat responses from Standby NameNode could end with some DataNodes missed the block key after Standby NameNode become active. > DataNode missed BlockKey update from NameNode due to HeartbeatResponse was > dropped > -- > > Key: HDFS-13441 > URL: https://issues.apache.org/jira/browse/HDFS-13441 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.7.1 >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Major > Attachments: HDFS-13441.patch > > > After NameNode failover, lots of application failed due to some DataNodes > can't re-compute password from block token. > {code:java} > 2018-04-11 20:10:52,448 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error > processing unknown operation src: /10.142.74.116:57404 dst: > /10.142.77.45:50010 > javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password > [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=1523538652448, > keyId=1762737944, userId=hadoop, > blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, > access modes=[WRITE]), since the required block key (keyID=1762737944) > doesn't exist.] > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598) > at > com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=1523538652448, > keyId=1762737944, userId=hadoop, > blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, > access modes=[WRITE]), since the required block key (keyID=1762737944) > doesn't exist. > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382) > at > org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241) > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589) > ... 7 more > {code} > > In the DataNode log, we didn't see DataNode update block keys around > 2018-04-11 09:55:00 and around 2018-04-11 19:55:00. > {code:java} > 2018-04-10 14:51:36,424 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-10 23:55:38,420 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 00:51:34,792 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 10:51:39,403 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 20:51:44,422 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-12 02:54:47,855 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 201
[jira] [Comment Edited] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped
[ https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437740#comment-16437740 ] yunjiong zhao edited comment on HDFS-13441 at 4/13/18 9:50 PM: --- {quote} BlockKey is usually synchronized aggressively – IIRC every 2.5 hours, and BlockKey's life time is much longer than that (can't recall right away) so it's surprising to me a single missing heartbeat would cause the error you mentioned. There's probably something deeper you need to dig into. {quote} DataNode gets block keys from NameNode happens at two places. One is when DataNode registers to NameNode. Another one is via heartbeat, by default happens every 600 minutes. By default, {quote} dfs.block.access.key.update.interval 600 Interval in minutes at which namenode updates its access keys. {quote} {quote} dfs.block.access.token.lifetime 600 The lifetime of access tokens in minutes. {quote} [~jojochuang] , double checked the code and log, the DataNode must be missing *two* heartbeats from Standby NameNode which contains new block key just before Standby NameNode become active. We have more than 2000 Datanodes in this cluster, if more than three Datanodes doesn't have the current Block key, worst case is users will not able to read some blocks for 10 at most hours. was (Author: zhaoyunjiong): {quote} BlockKey is usually synchronized aggressively – IIRC every 2.5 hours, and BlockKey's life time is much longer than that (can't recall right away) so it's surprising to me a single missing heartbeat would cause the error you mentioned. There's probably something deeper you need to dig into. {quote} DataNode gets block keys from NameNode happens at two places. One is when DataNode registers to NameNode. Another one is via heartbeat, by default happens every 600 minutes. By default, {quote} dfs.block.access.key.update.interval 600 Interval in minutes at which namenode updates its access keys. {quote} {quote} dfs.block.access.token.lifetime 600 The lifetime of access tokens in minutes. {quote} Not any single heartbeat of course. The missing heartbeat must from Standby NameNode which contains new block key, and Standby NameNode transferred to Active NameNode after that missing heartbeat happened and before the new heartbeat which contains new block keys. We have more than 2000 Datanodes in this cluster, if more than three Datanodes doesn't have the current Block key, worst case is users will not able to read some blocks for 10 hours. > DataNode missed BlockKey update from NameNode due to HeartbeatResponse was > dropped > -- > > Key: HDFS-13441 > URL: https://issues.apache.org/jira/browse/HDFS-13441 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.7.1 >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Major > Attachments: HDFS-13441.patch > > > After NameNode failover, lots of application failed due to some DataNodes > can't re-compute password from block token. > {code:java} > 2018-04-11 20:10:52,448 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error > processing unknown operation src: /10.142.74.116:57404 dst: > /10.142.77.45:50010 > javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password > [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=1523538652448, > keyId=1762737944, userId=hadoop, > blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, > access modes=[WRITE]), since the required block key (keyID=1762737944) > doesn't exist.] > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598) > at > com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-c
[jira] [Updated] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped
[ https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-13441: - Description: After NameNode failover, lots of application failed due to some DataNodes can't re-compute password from block token. {code:java} 2018-04-11 20:10:52,448 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error processing unknown operation src: /10.142.74.116:57404 dst: /10.142.77.45:50010 javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't re-compute password for block_token_identifier (expiryDate=1523538652448, keyId=1762737944, userId=hadoop, blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, access modes=[WRITE]), since the required block key (keyID=1762737944) doesn't exist.] at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598) at com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't re-compute password for block_token_identifier (expiryDate=1523538652448, keyId=1762737944, userId=hadoop, blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, access modes=[WRITE]), since the required block key (keyID=1762737944) doesn't exist. at org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382) at org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241) at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589) ... 7 more {code} In the DataNode log, we didn't see DataNode update block keys around 2018-04-11 09:55:00 and around 2018-04-11 19:55:00. {code:java} 2018-04-10 14:51:36,424 INFO org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting block keys 2018-04-10 23:55:38,420 INFO org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting block keys 2018-04-11 00:51:34,792 INFO org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting block keys 2018-04-11 10:51:39,403 INFO org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting block keys 2018-04-11 20:51:44,422 INFO org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting block keys 2018-04-12 02:54:47,855 INFO org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting block keys 2018-04-12 05:55:44,456 INFO org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting block keys {code} The reason is there is SocketTimeOutException when sending heartbeat to StandbyNameNode {code:java} 2018-04-11 09:55:34,699 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService java.net.SocketTimeoutException: Call From hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com/10.142.77.45 to ares-nn.vip.ebay.com:8030 failed on socket timeout exception: java.net.SocketTimeoutException: 6 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.142.77.45:48803 remote=ares-nn.vip.ebay.com/10.103.108.200:8030]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout at sun.reflect.GeneratedConstructorAccessor32.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorI
[jira] [Updated] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped
[ https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-13441: - Description: After NameNode failover, lots of application failed due to some DataNodes can't re-compute password from block token. {code:java} 2018-04-11 20:10:52,448 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error processing unknown operation src: /10.142.74.116:57404 dst: /10.142.77.45:50010 javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't re-compute password for block_token_identifier (expiryDate=1523538652448, keyId=1762737944, userId=hadoop, blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, access modes=[WRITE]), since the required block key (keyID=1762737944) doesn't exist.] at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598) at com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't re-compute password for block_token_identifier (expiryDate=1523538652448, keyId=1762737944, userId=hadoop, blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, access modes=[WRITE]), since the required block key (keyID=1762737944) doesn't exist. at org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382) at org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241) at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589) ... 7 more {code} In the DataNode log, we didn't see DataNode update block keys around 2018-04-11 09:55:00 and around 2018-04-11 19:55:00. {code:java} 2018-04-10 14:51:36,424 INFO org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting block keys 2018-04-10 23:55:38,420 INFO org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting block keys 2018-04-11 00:51:34,792 INFO org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting block keys 2018-04-11 10:51:39,403 INFO org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting block keys 2018-04-11 20:51:44,422 INFO org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting block keys 2018-04-12 02:54:47,855 INFO org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting block keys 2018-04-12 05:55:44,456 INFO org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting block keys {code} The reason is there is SocketTimeOutException when sending heartbeat to StandbyNameNode {code:java} 2018-04-11 09:55:34,699 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService java.net.SocketTimeoutException: Call From hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com/10.142.77.45 to ares-nn.vip.ebay.com:8030 failed on socket timeout exception: java.net.SocketTimeoutException: 6 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.142.77.45:48803 remote=ares-nn.vip.ebay.com/10.103.108.200:8030]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout at sun.reflect.GeneratedConstructorAccessor32.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorI
[jira] [Commented] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped
[ https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437740#comment-16437740 ] yunjiong zhao commented on HDFS-13441: -- {quote} BlockKey is usually synchronized aggressively – IIRC every 2.5 hours, and BlockKey's life time is much longer than that (can't recall right away) so it's surprising to me a single missing heartbeat would cause the error you mentioned. There's probably something deeper you need to dig into. {quote} DataNode gets block keys from NameNode happens at two places. One is when DataNode registers to NameNode. Another one is via heartbeat, by default happens every 600 minutes. By default, {quote} dfs.block.access.key.update.interval 600 Interval in minutes at which namenode updates its access keys. {quote} {quote} dfs.block.access.token.lifetime 600 The lifetime of access tokens in minutes. {quote} Not any single heartbeat of course. The missing heartbeat must from Standby NameNode which contains new block key, and Standby NameNode transferred to Active NameNode after that missing heartbeat happened and before the new heartbeat which contains new block keys. We have more than 2000 Datanodes in this cluster, if more than three Datanodes doesn't have the current Block key, worst case is users will not able to read some blocks for 10 hours. > DataNode missed BlockKey update from NameNode due to HeartbeatResponse was > dropped > -- > > Key: HDFS-13441 > URL: https://issues.apache.org/jira/browse/HDFS-13441 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.7.1 >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Major > Attachments: HDFS-13441.patch > > > After NameNode failover, lots of application failed due to some DataNodes > can't re-compute password from block token. > {code:java} > 2018-04-11 20:10:52,448 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error > processing unknown operation src: /10.142.74.116:57404 dst: > /10.142.77.45:50010 > javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password > [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=1523538652448, > keyId=1762737944, userId=hadoop, > blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, > access modes=[WRITE]), since the required block key (keyID=1762737944) > doesn't exist.] > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598) > at > com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=1523538652448, > keyId=1762737944, userId=hadoop, > blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, > access modes=[WRITE]), since the required block key (keyID=1762737944) > doesn't exist. > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382) > at > org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241) > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse
[jira] [Updated] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped
[ https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-13441: - Status: Patch Available (was: Open) There are two ways to fix this bug: one is making sure NameNode's send new BlockKey to DataNodes successfully; another one is once DataNode can't find BlockKey, re-register the Datanode to NameNodes to make sure DataNode get the newest BlockKeys. The first way is much more complex and need change more code than second way. The attached patch is re-register DataNode to NameNodes. Not tested, just for ideas. > DataNode missed BlockKey update from NameNode due to HeartbeatResponse was > dropped > -- > > Key: HDFS-13441 > URL: https://issues.apache.org/jira/browse/HDFS-13441 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.7.1 >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Major > Attachments: HDFS-13441.patch > > > After NameNode failover, lots of application failed due to some DataNodes > can't re-compute password from block token. > {code:java} > 2018-04-11 20:10:52,448 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error > processing unknown operation src: /10.142.74.116:57404 dst: > /10.142.77.45:50010 > javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password > [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=1523538652448, > keyId=1762737944, userId=hadoop, > blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, > access modes=[WRITE]), since the required block key (keyID=1762737944) > doesn't exist.] > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598) > at > com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=1523538652448, > keyId=1762737944, userId=hadoop, > blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, > access modes=[WRITE]), since the required block key (keyID=1762737944) > doesn't exist. > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382) > at > org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241) > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589) > ... 7 more > {code} > > In the DataNode log, we didn't see DataNode update block keys around > 2018-04-11 09:55. > {code:java} > 2018-04-10 14:51:36,424 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-10 23:55:38,420 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 00:51:34,792 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 10:51:39,403 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 20:51:44,422 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > b
[jira] [Updated] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped
[ https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-13441: - Attachment: HDFS-13441.patch > DataNode missed BlockKey update from NameNode due to HeartbeatResponse was > dropped > -- > > Key: HDFS-13441 > URL: https://issues.apache.org/jira/browse/HDFS-13441 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 2.7.1 >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Major > Attachments: HDFS-13441.patch > > > After NameNode failover, lots of application failed due to some DataNodes > can't re-compute password from block token. > {code:java} > 2018-04-11 20:10:52,448 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error > processing unknown operation src: /10.142.74.116:57404 dst: > /10.142.77.45:50010 > javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password > [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=1523538652448, > keyId=1762737944, userId=hadoop, > blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, > access modes=[WRITE]), since the required block key (keyID=1762737944) > doesn't exist.] > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598) > at > com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=1523538652448, > keyId=1762737944, userId=hadoop, > blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, > access modes=[WRITE]), since the required block key (keyID=1762737944) > doesn't exist. > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382) > at > org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241) > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589) > ... 7 more > {code} > > In the DataNode log, we didn't see DataNode update block keys around > 2018-04-11 09:55. > {code:java} > 2018-04-10 14:51:36,424 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-10 23:55:38,420 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 00:51:34,792 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 10:51:39,403 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-11 20:51:44,422 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-12 02:54:47,855 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > 2018-04-12 05:55:44,456 INFO > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting > block keys > {code} > The reason is there is SocketTimeOutException when send heartbeat to > StandbyNameNode > {code:java} > 2018-04-11 09:55:34,699 WARN org.apache.hadoop.hdf
[jira] [Updated] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped
[ https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-13441: - Description: After NameNode failover, lots of application failed due to some DataNodes can't re-compute password from block token. {code:java} 2018-04-11 20:10:52,448 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error processing unknown operation src: /10.142.74.116:57404 dst: /10.142.77.45:50010 javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't re-compute password for block_token_identifier (expiryDate=1523538652448, keyId=1762737944, userId=hadoop, blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, access modes=[WRITE]), since the required block key (keyID=1762737944) doesn't exist.] at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598) at com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't re-compute password for block_token_identifier (expiryDate=1523538652448, keyId=1762737944, userId=hadoop, blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, access modes=[WRITE]), since the required block key (keyID=1762737944) doesn't exist. at org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382) at org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241) at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589) ... 7 more {code} In the DataNode log, we didn't see DataNode update block keys around 2018-04-11 09:55. {code:java} 2018-04-10 14:51:36,424 INFO org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting block keys 2018-04-10 23:55:38,420 INFO org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting block keys 2018-04-11 00:51:34,792 INFO org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting block keys 2018-04-11 10:51:39,403 INFO org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting block keys 2018-04-11 20:51:44,422 INFO org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting block keys 2018-04-12 02:54:47,855 INFO org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting block keys 2018-04-12 05:55:44,456 INFO org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting block keys {code} The reason is there is SocketTimeOutException when send heartbeat to StandbyNameNode {code:java} 2018-04-11 09:55:34,699 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService java.net.SocketTimeoutException: Call From hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com/10.142.77.45 to ares-nn.vip.ebay.com:8030 failed on socket timeout exception: java.net.SocketTimeoutException: 6 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.142.77.45:48803 remote=ares-nn.vip.ebay.com/10.103.108.200:8030]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout at sun.reflect.GeneratedConstructorAccessor32.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.ref
[jira] [Created] (HDFS-13441) DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped
yunjiong zhao created HDFS-13441: Summary: DataNode missed BlockKey update from NameNode due to HeartbeatResponse was dropped Key: HDFS-13441 URL: https://issues.apache.org/jira/browse/HDFS-13441 Project: Hadoop HDFS Issue Type: Bug Components: datanode, namenode Affects Versions: 2.7.1 Reporter: yunjiong zhao Assignee: yunjiong zhao After NameNode failover, lots of application failed due to some DataNodes can't re-compute password from block token.2018-04-11 20:10:52,448 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error processing unknown operation src: /10.142.74.116:57404 dst: /10.142.77.45:50010 javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't re-compute password for block_token_identifier (expiryDate=1523538652448, keyId=1762737944, userId=hadoop, blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, access modes=[WRITE]), since the required block key (keyID=1762737944) doesn't exist.] at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598) at com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't re-compute password for block_token_identifier (expiryDate=1523538652448, keyId=1762737944, userId=hadoop, blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, access modes=[WRITE]), since the required block key (keyID=1762737944) doesn't exist. at org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382) at org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241) at com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589) ... 7 more -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15995826#comment-15995826 ] yunjiong zhao commented on HDFS-11384: -- [~shv] Thanks for the fix. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: Konstantin Shvachko > Fix For: 2.9.0, 2.7.4, 3.0.0-alpha3, 2.8.2 > > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch, HDFS-11384.003.patch, > HDFS-11384.004.patch, HDFS-11384.005.patch, HDFS-11384.006.patch, > HDFS-11384-007.patch, HDFS-11384.008.patch, HDFS-11384.009.patch, > HDFS-11384.010.patch, HDFS-11384.011.patch, HDFS-11384-branch-2.7.011.patch, > HDFS-11384-branch-2.8.011.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15933354#comment-15933354 ] yunjiong zhao commented on HDFS-11384: -- Thanks [~shv] for review. Only when you set dfs.balancer.getBlocks.interval.millis to non-zero, Balancer will only allow one thread to issue {code}getBlocks(){code} at any given time. Otherwise this patch doesn't change anything. So only one change actually. If use wait, it will release the lock, so can't make sure there are only one thread will call {code}getBlocks(){code}. By default, this patch doesn't change anything. So if you need run Balancer aggressively, don't set dfs.balancer.getBlocks.interval.millis. {quote} Can we add some heuristics so that the Balancer could adjust by itself instead of adding the configuration parameter {quote} I though this before. The best way I can thought is add new function in IPC that let clients get the CallQueueLength, if CallQueueLength is too high, block getBlocks() until the CallQueueLength become normal again. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15933354#comment-15933354 ] yunjiong zhao edited comment on HDFS-11384 at 3/20/17 7:26 PM: --- Thanks [~shv] for review. Only when you set dfs.balancer.getBlocks.interval.millis to non-zero, Balancer will only allow one thread to issue getBlocks()at any given time. Otherwise this patch doesn't change anything. So only one change actually. If use wait, it will release the lock, so can't make sure there are only one thread will call getBlocks(). By default, this patch doesn't change anything. So if you need run Balancer aggressively, don't set dfs.balancer.getBlocks.interval.millis. {quote} Can we add some heuristics so that the Balancer could adjust by itself instead of adding the configuration parameter {quote} I though this before. The best way I can thought is add new function in IPC that let clients get the CallQueueLength, if CallQueueLength is too high, block getBlocks() until the CallQueueLength become normal again. was (Author: zhaoyunjiong): Thanks [~shv] for review. Only when you set dfs.balancer.getBlocks.interval.millis to non-zero, Balancer will only allow one thread to issue {code}getBlocks(){code} at any given time. Otherwise this patch doesn't change anything. So only one change actually. If use wait, it will release the lock, so can't make sure there are only one thread will call {code}getBlocks(){code}. By default, this patch doesn't change anything. So if you need run Balancer aggressively, don't set dfs.balancer.getBlocks.interval.millis. {quote} Can we add some heuristics so that the Balancer could adjust by itself instead of adding the configuration parameter {quote} I though this before. The best way I can thought is add new function in IPC that let clients get the CallQueueLength, if CallQueueLength is too high, block getBlocks() until the CallQueueLength become normal again. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-11384: - Attachment: HDFS-11384.002.patch > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-11384: - Attachment: (was: HDFS-11384.002.patch) > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, HDFS-11384.001.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-11384: - Attachment: HDFS-11384.002.patch Use Semaphore instead lock to avoid findbug warning. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.002.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-11384: - Attachment: (was: HDFS-11384.001.patch) > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, HDFS-11384.001.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-11384: - Attachment: HDFS-11384.001.patch > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, > HDFS-11384.001.patch, HDFS-11384.001.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891198#comment-15891198 ] yunjiong zhao commented on HDFS-11384: -- Thank you [~benoyantony] for your time to review this patch. {quote}Sleeping inside the Synchronized block should be avoided as it will prevent other threads from obtaining the lock while the thread is sleeping. {quote} I did it on purpose for sleeping inside the Synchronized block. In balancer there are multiple threads (by default 200) that may call getBlocks at same time, if user need to set dfs.balancer.getBlocks.interval.millis to slow down balancer, without a lock it won't work well due to at worst case there are still 200 getBlocks send to NameNode at same time. {quote}It will be better to keep track of the interval between successive getBlocks and sleep only for the required time. {quote} Since by default, this patch doesn't change anything, only add a option let user slow down balancer send getBlocks to NameNode, so I'd like to keep it as simple as possible. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, HDFS-11384.001.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-11384: - Attachment: balancer.day.png balancer.week.png HDFS-11384.001.patch This patch provide a option to let balancer blocked for $dfs.balancer.getBlocks.interval.millis milliseconds after every getBlocks RPC call. The attached pictures shows the improvements after I apply this patch to our production cluster around Thursday 15:00. > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, HDFS-11384.001.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
[ https://issues.apache.org/jira/browse/HDFS-11384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-11384: - Status: Patch Available (was: Open) > Add option for balancer to disperse getBlocks calls to avoid NameNode's > rpc.CallQueueLength spike > - > > Key: HDFS-11384 > URL: https://issues.apache.org/jira/browse/HDFS-11384 > Project: Hadoop HDFS > Issue Type: Improvement > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: balancer.day.png, balancer.week.png, HDFS-11384.001.patch > > > When running balancer on hadoop cluster which have more than 3000 Datanodes > will cause NameNode's rpc.CallQueueLength spike. We observed this situation > could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-11384) Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike
yunjiong zhao created HDFS-11384: Summary: Add option for balancer to disperse getBlocks calls to avoid NameNode's rpc.CallQueueLength spike Key: HDFS-11384 URL: https://issues.apache.org/jira/browse/HDFS-11384 Project: Hadoop HDFS Issue Type: Improvement Components: balancer & mover Affects Versions: 2.7.3 Reporter: yunjiong zhao Assignee: yunjiong zhao When running balancer on hadoop cluster which have more than 3000 Datanodes will cause NameNode's rpc.CallQueueLength spike. We observed this situation could cause Hbase cluster failure due to RegionServer's WAL timeout. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-11377) Balancer hung due to no available mover threads
[ https://issues.apache.org/jira/browse/HDFS-11377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15848792#comment-15848792 ] yunjiong zhao edited comment on HDFS-11377 at 2/1/17 7:03 PM: -- Removed unused variable MAX_NO_PENDING_MOVE_ITERATIONS. Thanks [~linyiqun] for your time. was (Author: zhaoyunjiong): Remove unused variable MAX_NO_PENDING_MOVE_ITERATIONS. Thanks [~linyiqun] for your time. > Balancer hung due to no available mover threads > --- > > Key: HDFS-11377 > URL: https://issues.apache.org/jira/browse/HDFS-11377 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: HDFS-11377.001.patch, HDFS-11377.002.patch > > > When running balancer on large cluster which have more than 3000 Datanodes, > it might be hung due to "No mover threads available". > The stack trace shows it waiting forever like below. > {code} > "main" #1 prio=5 os_prio=0 tid=0x7ff6cc014800 nid=0x6b2c waiting on > condition [0x7ff6d1bad000] >java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher.waitForMoveCompletion(Dispatcher.java:1043) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher.dispatchBlockMoves(Dispatcher.java:1017) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher.dispatchAndCheckContinue(Dispatcher.java:981) > at > org.apache.hadoop.hdfs.server.balancer.Balancer.runOneIteration(Balancer.java:611) > at > org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:663) > at > org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:776) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at > org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:905) > {code} > In the log, there are lots of WARN about "No mover threads available". > {quote} > 2017-01-26 15:36:40,085 WARN > org.apache.hadoop.hdfs.server.balancer.Dispatcher: No mover threads > available: skip moving blk_13700554102_1112815018180 with size=268435456 from > 10.115.67.137:50010:DISK to 10.140.21.55:50010:DISK through > 10.115.67.137:50010 > 2017-01-26 15:36:40,085 WARN > org.apache.hadoop.hdfs.server.balancer.Dispatcher: No mover threads > available: skip moving blk_4009558842_1103118359883 with size=268435456 from > 10.115.67.137:50010:DISK to 10.140.21.55:50010:DISK through > 10.115.67.137:50010 > 2017-01-26 15:36:40,085 WARN > org.apache.hadoop.hdfs.server.balancer.Dispatcher: No mover threads > available: skip moving blk_13881956058_1112996460026 with size=133509566 from > 10.115.67.137:50010:DISK to 10.140.21.55:50010:DISK through 10.115.67.36:50010 > {quote} > What happened here is, when there are no mover threads available, > DDatanode.isPendingQEmpty() will return false, so Balancer hung. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11377) Balancer hung due to no available mover threads
[ https://issues.apache.org/jira/browse/HDFS-11377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-11377: - Attachment: HDFS-11377.002.patch Remove unused variable MAX_NO_PENDING_MOVE_ITERATIONS. Thanks [~linyiqun] for your time. > Balancer hung due to no available mover threads > --- > > Key: HDFS-11377 > URL: https://issues.apache.org/jira/browse/HDFS-11377 > Project: Hadoop HDFS > Issue Type: Bug > Components: balancer & mover >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: HDFS-11377.001.patch, HDFS-11377.002.patch > > > When running balancer on large cluster which have more than 3000 Datanodes, > it might be hung due to "No mover threads available". > The stack trace shows it waiting forever like below. > {code} > "main" #1 prio=5 os_prio=0 tid=0x7ff6cc014800 nid=0x6b2c waiting on > condition [0x7ff6d1bad000] >java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher.waitForMoveCompletion(Dispatcher.java:1043) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher.dispatchBlockMoves(Dispatcher.java:1017) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher.dispatchAndCheckContinue(Dispatcher.java:981) > at > org.apache.hadoop.hdfs.server.balancer.Balancer.runOneIteration(Balancer.java:611) > at > org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:663) > at > org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:776) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at > org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:905) > {code} > In the log, there are lots of WARN about "No mover threads available". > {quote} > 2017-01-26 15:36:40,085 WARN > org.apache.hadoop.hdfs.server.balancer.Dispatcher: No mover threads > available: skip moving blk_13700554102_1112815018180 with size=268435456 from > 10.115.67.137:50010:DISK to 10.140.21.55:50010:DISK through > 10.115.67.137:50010 > 2017-01-26 15:36:40,085 WARN > org.apache.hadoop.hdfs.server.balancer.Dispatcher: No mover threads > available: skip moving blk_4009558842_1103118359883 with size=268435456 from > 10.115.67.137:50010:DISK to 10.140.21.55:50010:DISK through > 10.115.67.137:50010 > 2017-01-26 15:36:40,085 WARN > org.apache.hadoop.hdfs.server.balancer.Dispatcher: No mover threads > available: skip moving blk_13881956058_1112996460026 with size=133509566 from > 10.115.67.137:50010:DISK to 10.140.21.55:50010:DISK through 10.115.67.36:50010 > {quote} > What happened here is, when there are no mover threads available, > DDatanode.isPendingQEmpty() will return false, so Balancer hung. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11377) Balancer hung due to "No mover threads available"
[ https://issues.apache.org/jira/browse/HDFS-11377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-11377: - Status: Patch Available (was: Open) > Balancer hung due to "No mover threads available" > - > > Key: HDFS-11377 > URL: https://issues.apache.org/jira/browse/HDFS-11377 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: HDFS-11377.001.patch > > > When running balancer on large cluster which have more than 3000 Datanodes, > it might be hung due to "No mover threads available". > The stack trace shows it waiting forever like below. > {code} > "main" #1 prio=5 os_prio=0 tid=0x7ff6cc014800 nid=0x6b2c waiting on > condition [0x7ff6d1bad000] >java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher.waitForMoveCompletion(Dispatcher.java:1043) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher.dispatchBlockMoves(Dispatcher.java:1017) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher.dispatchAndCheckContinue(Dispatcher.java:981) > at > org.apache.hadoop.hdfs.server.balancer.Balancer.runOneIteration(Balancer.java:611) > at > org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:663) > at > org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:776) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at > org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:905) > {code} > In the log, there are lots of WARN about "No mover threads available". > {quote} > 2017-01-26 15:36:40,085 WARN > org.apache.hadoop.hdfs.server.balancer.Dispatcher: No mover threads > available: skip moving blk_13700554102_1112815018180 with size=268435456 from > 10.115.67.137:50010:DISK to 10.140.21.55:50010:DISK through > 10.115.67.137:50010 > 2017-01-26 15:36:40,085 WARN > org.apache.hadoop.hdfs.server.balancer.Dispatcher: No mover threads > available: skip moving blk_4009558842_1103118359883 with size=268435456 from > 10.115.67.137:50010:DISK to 10.140.21.55:50010:DISK through > 10.115.67.137:50010 > 2017-01-26 15:36:40,085 WARN > org.apache.hadoop.hdfs.server.balancer.Dispatcher: No mover threads > available: skip moving blk_13881956058_1112996460026 with size=133509566 from > 10.115.67.137:50010:DISK to 10.140.21.55:50010:DISK through 10.115.67.36:50010 > {quote} > What happened here is, when there are no mover threads available, > DDatanode.isPendingQEmpty() will return false, so Balancer hung. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-11377) Balancer hung due to "No mover threads available"
[ https://issues.apache.org/jira/browse/HDFS-11377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-11377: - Attachment: HDFS-11377.001.patch Remove PendingMove if after "No mover threads available" in this patch. By setting dfs.balancer.moverThreads to a value big than dfs.datanode.balance.max.concurrent.moves * also works. > Balancer hung due to "No mover threads available" > - > > Key: HDFS-11377 > URL: https://issues.apache.org/jira/browse/HDFS-11377 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.7.3 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: HDFS-11377.001.patch > > > When running balancer on large cluster which have more than 3000 Datanodes, > it might be hung due to "No mover threads available". > The stack trace shows it waiting forever like below. > {code} > "main" #1 prio=5 os_prio=0 tid=0x7ff6cc014800 nid=0x6b2c waiting on > condition [0x7ff6d1bad000] >java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher.waitForMoveCompletion(Dispatcher.java:1043) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher.dispatchBlockMoves(Dispatcher.java:1017) > at > org.apache.hadoop.hdfs.server.balancer.Dispatcher.dispatchAndCheckContinue(Dispatcher.java:981) > at > org.apache.hadoop.hdfs.server.balancer.Balancer.runOneIteration(Balancer.java:611) > at > org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:663) > at > org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:776) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > at > org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:905) > {code} > In the log, there are lots of WARN about "No mover threads available". > {quote} > 2017-01-26 15:36:40,085 WARN > org.apache.hadoop.hdfs.server.balancer.Dispatcher: No mover threads > available: skip moving blk_13700554102_1112815018180 with size=268435456 from > 10.115.67.137:50010:DISK to 10.140.21.55:50010:DISK through > 10.115.67.137:50010 > 2017-01-26 15:36:40,085 WARN > org.apache.hadoop.hdfs.server.balancer.Dispatcher: No mover threads > available: skip moving blk_4009558842_1103118359883 with size=268435456 from > 10.115.67.137:50010:DISK to 10.140.21.55:50010:DISK through > 10.115.67.137:50010 > 2017-01-26 15:36:40,085 WARN > org.apache.hadoop.hdfs.server.balancer.Dispatcher: No mover threads > available: skip moving blk_13881956058_1112996460026 with size=133509566 from > 10.115.67.137:50010:DISK to 10.140.21.55:50010:DISK through 10.115.67.36:50010 > {quote} > What happened here is, when there are no mover threads available, > DDatanode.isPendingQEmpty() will return false, so Balancer hung. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-11377) Balancer hung due to "No mover threads available"
yunjiong zhao created HDFS-11377: Summary: Balancer hung due to "No mover threads available" Key: HDFS-11377 URL: https://issues.apache.org/jira/browse/HDFS-11377 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.7.3 Reporter: yunjiong zhao Assignee: yunjiong zhao When running balancer on large cluster which have more than 3000 Datanodes, it might be hung due to "No mover threads available". The stack trace shows it waiting forever like below. {code} "main" #1 prio=5 os_prio=0 tid=0x7ff6cc014800 nid=0x6b2c waiting on condition [0x7ff6d1bad000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.hdfs.server.balancer.Dispatcher.waitForMoveCompletion(Dispatcher.java:1043) at org.apache.hadoop.hdfs.server.balancer.Dispatcher.dispatchBlockMoves(Dispatcher.java:1017) at org.apache.hadoop.hdfs.server.balancer.Dispatcher.dispatchAndCheckContinue(Dispatcher.java:981) at org.apache.hadoop.hdfs.server.balancer.Balancer.runOneIteration(Balancer.java:611) at org.apache.hadoop.hdfs.server.balancer.Balancer.run(Balancer.java:663) at org.apache.hadoop.hdfs.server.balancer.Balancer$Cli.run(Balancer.java:776) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.hdfs.server.balancer.Balancer.main(Balancer.java:905) {code} In the log, there are lots of WARN about "No mover threads available". {quote} 2017-01-26 15:36:40,085 WARN org.apache.hadoop.hdfs.server.balancer.Dispatcher: No mover threads available: skip moving blk_13700554102_1112815018180 with size=268435456 from 10.115.67.137:50010:DISK to 10.140.21.55:50010:DISK through 10.115.67.137:50010 2017-01-26 15:36:40,085 WARN org.apache.hadoop.hdfs.server.balancer.Dispatcher: No mover threads available: skip moving blk_4009558842_1103118359883 with size=268435456 from 10.115.67.137:50010:DISK to 10.140.21.55:50010:DISK through 10.115.67.137:50010 2017-01-26 15:36:40,085 WARN org.apache.hadoop.hdfs.server.balancer.Dispatcher: No mover threads available: skip moving blk_13881956058_1112996460026 with size=133509566 from 10.115.67.137:50010:DISK to 10.140.21.55:50010:DISK through 10.115.67.36:50010 {quote} What happened here is, when there are no mover threads available, DDatanode.isPendingQEmpty() will return false, so Balancer hung. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-10831) Add log when URLConnectionFactory.openConnection failed
[ https://issues.apache.org/jira/browse/HDFS-10831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-10831: - Status: Patch Available (was: Open) > Add log when URLConnectionFactory.openConnection failed > --- > > Key: HDFS-10831 > URL: https://issues.apache.org/jira/browse/HDFS-10831 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: HDFS-10831.001.patch > > > When I try to use swebhdfs, due to client missing a certificate in > truststore, it failed, but from client I don't see any warning or error log, > I can only find it on datanode side, which is not convenient. > Add log can help user know what happened without check server side log. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-10831) Add log when URLConnectionFactory.openConnection failed
[ https://issues.apache.org/jira/browse/HDFS-10831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-10831: - Attachment: HDFS-10831.001.patch Since this patch only added on line code to log error information, so there is no need to add any test. > Add log when URLConnectionFactory.openConnection failed > --- > > Key: HDFS-10831 > URL: https://issues.apache.org/jira/browse/HDFS-10831 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: HDFS-10831.001.patch > > > When I try to use swebhdfs, due to client missing a certificate in > truststore, it failed, but from client I don't see any warning or error log, > I can only find it on datanode side, which is not convenient. > Add log can help user know what happened without check server side log. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-10831) Add log when URLConnectionFactory.openConnection failed
yunjiong zhao created HDFS-10831: Summary: Add log when URLConnectionFactory.openConnection failed Key: HDFS-10831 URL: https://issues.apache.org/jira/browse/HDFS-10831 Project: Hadoop HDFS Issue Type: Improvement Reporter: yunjiong zhao Assignee: yunjiong zhao When I try to use swebhdfs, due to client missing a certificate in truststore, it failed, but from client I don't see any warning or error log, I can only find it on datanode side, which is not convenient. Add log can help user know what happened without check server side log. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-10477) Stop decommission a rack of DataNodes caused NameNode fail over to standby
[ https://issues.apache.org/jira/browse/HDFS-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-10477: - Attachment: HDFS-10477.005.patch Update patch to fix unit test. When called by tests like TestDefaultBlockPlacementPolicy.testPlacementWithLocalRackNodesDecommissioned, it might not have write lock. Thanks [~rakeshr] for suggestion on the InterruptedException. Handler thread is daemon thread, but you are right, it's better call Thread.currentThread().interrupt() to keep interrupt status. > Stop decommission a rack of DataNodes caused NameNode fail over to standby > -- > > Key: HDFS-10477 > URL: https://issues.apache.org/jira/browse/HDFS-10477 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: HDFS-10477.002.patch, HDFS-10477.003.patch, > HDFS-10477.004.patch, HDFS-10477.005.patch, HDFS-10477.patch > > > In our cluster, when we stop decommissioning a rack which have 46 DataNodes, > it locked Namesystem for about 7 minutes as below log shows: > {code} > 2016-05-26 20:11:41,697 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.27:1004 > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 285258 over-replicated blocks on 10.142.27.27:1004 during recommissioning > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.118:1004 > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 279923 over-replicated blocks on 10.142.27.118:1004 during recommissioning > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.113:1004 > 2016-05-26 20:12:09,007 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 294307 over-replicated blocks on 10.142.27.113:1004 during recommissioning > 2016-05-26 20:12:09,008 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.117:1004 > 2016-05-26 20:12:18,055 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 314381 over-replicated blocks on 10.142.27.117:1004 during recommissioning > 2016-05-26 20:12:18,056 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.130:1004 > 2016-05-26 20:12:25,938 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 272779 over-replicated blocks on 10.142.27.130:1004 during recommissioning > 2016-05-26 20:12:25,939 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.121:1004 > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 287248 over-replicated blocks on 10.142.27.121:1004 during recommissioning > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.33:1004 > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 299868 over-replicated blocks on 10.142.27.33:1004 during recommissioning > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.137:1004 > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 303914 over-replicated blocks on 10.142.27.137:1004 during recommissioning > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.51:1004 > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 281175 over-replicated blocks on 10.142.27.51:1004 during recommissioning > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.12:1004 > 2016-05-26 20:13:08,756 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 274880 over-replicated blocks on 10.142.27.12:1004 during recommissioning > 2016-05-26 20:13:08,757 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.15:1004 > 2016-05-26 20:13:17,185 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 286334 over-replicated blocks on 10.142.27.15:1004 during recommissioning > 2016-05-26 20:13:17,185 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop
[jira] [Updated] (HDFS-10477) Stop decommission a rack of DataNodes caused NameNode fail over to standby
[ https://issues.apache.org/jira/browse/HDFS-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-10477: - Attachment: HDFS-10477.004.patch Update patch with below changes: 1. release lock after finish process one storage 2. sleep 1 millisecond before try to require lock again Thanks [~arpiagariu] and [~kihwal]. > Stop decommission a rack of DataNodes caused NameNode fail over to standby > -- > > Key: HDFS-10477 > URL: https://issues.apache.org/jira/browse/HDFS-10477 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: HDFS-10477.002.patch, HDFS-10477.003.patch, > HDFS-10477.004.patch, HDFS-10477.patch > > > In our cluster, when we stop decommissioning a rack which have 46 DataNodes, > it locked Namesystem for about 7 minutes as below log shows: > {code} > 2016-05-26 20:11:41,697 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.27:1004 > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 285258 over-replicated blocks on 10.142.27.27:1004 during recommissioning > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.118:1004 > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 279923 over-replicated blocks on 10.142.27.118:1004 during recommissioning > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.113:1004 > 2016-05-26 20:12:09,007 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 294307 over-replicated blocks on 10.142.27.113:1004 during recommissioning > 2016-05-26 20:12:09,008 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.117:1004 > 2016-05-26 20:12:18,055 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 314381 over-replicated blocks on 10.142.27.117:1004 during recommissioning > 2016-05-26 20:12:18,056 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.130:1004 > 2016-05-26 20:12:25,938 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 272779 over-replicated blocks on 10.142.27.130:1004 during recommissioning > 2016-05-26 20:12:25,939 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.121:1004 > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 287248 over-replicated blocks on 10.142.27.121:1004 during recommissioning > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.33:1004 > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 299868 over-replicated blocks on 10.142.27.33:1004 during recommissioning > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.137:1004 > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 303914 over-replicated blocks on 10.142.27.137:1004 during recommissioning > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.51:1004 > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 281175 over-replicated blocks on 10.142.27.51:1004 during recommissioning > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.12:1004 > 2016-05-26 20:13:08,756 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 274880 over-replicated blocks on 10.142.27.12:1004 during recommissioning > 2016-05-26 20:13:08,757 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.15:1004 > 2016-05-26 20:13:17,185 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 286334 over-replicated blocks on 10.142.27.15:1004 during recommissioning > 2016-05-26 20:13:17,185 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.14:1004 > 2016-05-26 20:13:25,369 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 280219 over-replicated blocks on 10.142.27.14:1004 during rec
[jira] [Commented] (HDFS-10477) Stop decommission a rack of DataNodes caused NameNode fail over to standby
[ https://issues.apache.org/jira/browse/HDFS-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15366966#comment-15366966 ] yunjiong zhao commented on HDFS-10477: -- Those failed unit test is not related to this patch. And there is no need to add new unit test for this patch since it's only add steps to release the nameSystem writeLock and then acquire the lock again. > Stop decommission a rack of DataNodes caused NameNode fail over to standby > -- > > Key: HDFS-10477 > URL: https://issues.apache.org/jira/browse/HDFS-10477 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: HDFS-10477.002.patch, HDFS-10477.003.patch, > HDFS-10477.patch > > > In our cluster, when we stop decommissioning a rack which have 46 DataNodes, > it locked Namesystem for about 7 minutes as below log shows: > {code} > 2016-05-26 20:11:41,697 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.27:1004 > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 285258 over-replicated blocks on 10.142.27.27:1004 during recommissioning > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.118:1004 > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 279923 over-replicated blocks on 10.142.27.118:1004 during recommissioning > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.113:1004 > 2016-05-26 20:12:09,007 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 294307 over-replicated blocks on 10.142.27.113:1004 during recommissioning > 2016-05-26 20:12:09,008 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.117:1004 > 2016-05-26 20:12:18,055 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 314381 over-replicated blocks on 10.142.27.117:1004 during recommissioning > 2016-05-26 20:12:18,056 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.130:1004 > 2016-05-26 20:12:25,938 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 272779 over-replicated blocks on 10.142.27.130:1004 during recommissioning > 2016-05-26 20:12:25,939 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.121:1004 > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 287248 over-replicated blocks on 10.142.27.121:1004 during recommissioning > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.33:1004 > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 299868 over-replicated blocks on 10.142.27.33:1004 during recommissioning > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.137:1004 > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 303914 over-replicated blocks on 10.142.27.137:1004 during recommissioning > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.51:1004 > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 281175 over-replicated blocks on 10.142.27.51:1004 during recommissioning > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.12:1004 > 2016-05-26 20:13:08,756 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 274880 over-replicated blocks on 10.142.27.12:1004 during recommissioning > 2016-05-26 20:13:08,757 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.15:1004 > 2016-05-26 20:13:17,185 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 286334 over-replicated blocks on 10.142.27.15:1004 during recommissioning > 2016-05-26 20:13:17,185 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.14:1004 > 2016-05-26 20:13:25,369 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 280219 over-replicated blocks on 1
[jira] [Updated] (HDFS-10477) Stop decommission a rack of DataNodes caused NameNode fail over to standby
[ https://issues.apache.org/jira/browse/HDFS-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-10477: - Status: Open (was: Patch Available) > Stop decommission a rack of DataNodes caused NameNode fail over to standby > -- > > Key: HDFS-10477 > URL: https://issues.apache.org/jira/browse/HDFS-10477 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: HDFS-10477.002.patch, HDFS-10477.003.patch, > HDFS-10477.patch > > > In our cluster, when we stop decommissioning a rack which have 46 DataNodes, > it locked Namesystem for about 7 minutes as below log shows: > {code} > 2016-05-26 20:11:41,697 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.27:1004 > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 285258 over-replicated blocks on 10.142.27.27:1004 during recommissioning > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.118:1004 > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 279923 over-replicated blocks on 10.142.27.118:1004 during recommissioning > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.113:1004 > 2016-05-26 20:12:09,007 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 294307 over-replicated blocks on 10.142.27.113:1004 during recommissioning > 2016-05-26 20:12:09,008 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.117:1004 > 2016-05-26 20:12:18,055 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 314381 over-replicated blocks on 10.142.27.117:1004 during recommissioning > 2016-05-26 20:12:18,056 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.130:1004 > 2016-05-26 20:12:25,938 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 272779 over-replicated blocks on 10.142.27.130:1004 during recommissioning > 2016-05-26 20:12:25,939 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.121:1004 > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 287248 over-replicated blocks on 10.142.27.121:1004 during recommissioning > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.33:1004 > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 299868 over-replicated blocks on 10.142.27.33:1004 during recommissioning > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.137:1004 > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 303914 over-replicated blocks on 10.142.27.137:1004 during recommissioning > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.51:1004 > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 281175 over-replicated blocks on 10.142.27.51:1004 during recommissioning > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.12:1004 > 2016-05-26 20:13:08,756 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 274880 over-replicated blocks on 10.142.27.12:1004 during recommissioning > 2016-05-26 20:13:08,757 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.15:1004 > 2016-05-26 20:13:17,185 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 286334 over-replicated blocks on 10.142.27.15:1004 during recommissioning > 2016-05-26 20:13:17,185 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.14:1004 > 2016-05-26 20:13:25,369 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 280219 over-replicated blocks on 10.142.27.14:1004 during recommissioning > 2016-05-26 20:13:25,370 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.28:1004 > 2016-05-26 20:13:33,768 INFO > org.apache
[jira] [Updated] (HDFS-10477) Stop decommission a rack of DataNodes caused NameNode fail over to standby
[ https://issues.apache.org/jira/browse/HDFS-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-10477: - Status: Patch Available (was: Open) > Stop decommission a rack of DataNodes caused NameNode fail over to standby > -- > > Key: HDFS-10477 > URL: https://issues.apache.org/jira/browse/HDFS-10477 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: HDFS-10477.002.patch, HDFS-10477.003.patch, > HDFS-10477.patch > > > In our cluster, when we stop decommissioning a rack which have 46 DataNodes, > it locked Namesystem for about 7 minutes as below log shows: > {code} > 2016-05-26 20:11:41,697 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.27:1004 > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 285258 over-replicated blocks on 10.142.27.27:1004 during recommissioning > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.118:1004 > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 279923 over-replicated blocks on 10.142.27.118:1004 during recommissioning > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.113:1004 > 2016-05-26 20:12:09,007 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 294307 over-replicated blocks on 10.142.27.113:1004 during recommissioning > 2016-05-26 20:12:09,008 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.117:1004 > 2016-05-26 20:12:18,055 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 314381 over-replicated blocks on 10.142.27.117:1004 during recommissioning > 2016-05-26 20:12:18,056 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.130:1004 > 2016-05-26 20:12:25,938 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 272779 over-replicated blocks on 10.142.27.130:1004 during recommissioning > 2016-05-26 20:12:25,939 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.121:1004 > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 287248 over-replicated blocks on 10.142.27.121:1004 during recommissioning > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.33:1004 > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 299868 over-replicated blocks on 10.142.27.33:1004 during recommissioning > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.137:1004 > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 303914 over-replicated blocks on 10.142.27.137:1004 during recommissioning > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.51:1004 > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 281175 over-replicated blocks on 10.142.27.51:1004 during recommissioning > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.12:1004 > 2016-05-26 20:13:08,756 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 274880 over-replicated blocks on 10.142.27.12:1004 during recommissioning > 2016-05-26 20:13:08,757 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.15:1004 > 2016-05-26 20:13:17,185 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 286334 over-replicated blocks on 10.142.27.15:1004 during recommissioning > 2016-05-26 20:13:17,185 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.14:1004 > 2016-05-26 20:13:25,369 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 280219 over-replicated blocks on 10.142.27.14:1004 during recommissioning > 2016-05-26 20:13:25,370 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.28:1004 > 2016-05-26 20:13:33,768 INFO > org.apache
[jira] [Updated] (HDFS-10477) Stop decommission a rack of DataNodes caused NameNode fail over to standby
[ https://issues.apache.org/jira/browse/HDFS-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-10477: - Attachment: HDFS-10477.003.patch Update patch according comments. Thanks [~benoyantony] > Stop decommission a rack of DataNodes caused NameNode fail over to standby > -- > > Key: HDFS-10477 > URL: https://issues.apache.org/jira/browse/HDFS-10477 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: HDFS-10477.002.patch, HDFS-10477.003.patch, > HDFS-10477.patch > > > In our cluster, when we stop decommissioning a rack which have 46 DataNodes, > it locked Namesystem for about 7 minutes as below log shows: > {code} > 2016-05-26 20:11:41,697 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.27:1004 > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 285258 over-replicated blocks on 10.142.27.27:1004 during recommissioning > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.118:1004 > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 279923 over-replicated blocks on 10.142.27.118:1004 during recommissioning > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.113:1004 > 2016-05-26 20:12:09,007 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 294307 over-replicated blocks on 10.142.27.113:1004 during recommissioning > 2016-05-26 20:12:09,008 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.117:1004 > 2016-05-26 20:12:18,055 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 314381 over-replicated blocks on 10.142.27.117:1004 during recommissioning > 2016-05-26 20:12:18,056 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.130:1004 > 2016-05-26 20:12:25,938 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 272779 over-replicated blocks on 10.142.27.130:1004 during recommissioning > 2016-05-26 20:12:25,939 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.121:1004 > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 287248 over-replicated blocks on 10.142.27.121:1004 during recommissioning > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.33:1004 > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 299868 over-replicated blocks on 10.142.27.33:1004 during recommissioning > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.137:1004 > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 303914 over-replicated blocks on 10.142.27.137:1004 during recommissioning > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.51:1004 > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 281175 over-replicated blocks on 10.142.27.51:1004 during recommissioning > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.12:1004 > 2016-05-26 20:13:08,756 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 274880 over-replicated blocks on 10.142.27.12:1004 during recommissioning > 2016-05-26 20:13:08,757 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.15:1004 > 2016-05-26 20:13:17,185 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 286334 over-replicated blocks on 10.142.27.15:1004 during recommissioning > 2016-05-26 20:13:17,185 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.14:1004 > 2016-05-26 20:13:25,369 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 280219 over-replicated blocks on 10.142.27.14:1004 during recommissioning > 2016-05-26 20:13:25,370 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.
[jira] [Commented] (HDFS-10477) Stop decommission a rack of DataNodes caused NameNode fail over to standby
[ https://issues.apache.org/jira/browse/HDFS-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15314451#comment-15314451 ] yunjiong zhao commented on HDFS-10477: -- [~kihwal], What's your opinion on the second patch? Any suggestions? Should I combine those two patches? I mean, for the second patch if no Datanode have more blocks than DFS_BLOCK_MISREPLICATION_PROCESSING_LIMIT and if there are lots of DataNodes, then still might have trouble. > Stop decommission a rack of DataNodes caused NameNode fail over to standby > -- > > Key: HDFS-10477 > URL: https://issues.apache.org/jira/browse/HDFS-10477 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: HDFS-10477.002.patch, HDFS-10477.patch > > > In our cluster, when we stop decommissioning a rack which have 46 DataNodes, > it locked Namesystem for about 7 minutes as below log shows: > {code} > 2016-05-26 20:11:41,697 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.27:1004 > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 285258 over-replicated blocks on 10.142.27.27:1004 during recommissioning > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.118:1004 > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 279923 over-replicated blocks on 10.142.27.118:1004 during recommissioning > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.113:1004 > 2016-05-26 20:12:09,007 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 294307 over-replicated blocks on 10.142.27.113:1004 during recommissioning > 2016-05-26 20:12:09,008 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.117:1004 > 2016-05-26 20:12:18,055 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 314381 over-replicated blocks on 10.142.27.117:1004 during recommissioning > 2016-05-26 20:12:18,056 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.130:1004 > 2016-05-26 20:12:25,938 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 272779 over-replicated blocks on 10.142.27.130:1004 during recommissioning > 2016-05-26 20:12:25,939 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.121:1004 > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 287248 over-replicated blocks on 10.142.27.121:1004 during recommissioning > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.33:1004 > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 299868 over-replicated blocks on 10.142.27.33:1004 during recommissioning > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.137:1004 > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 303914 over-replicated blocks on 10.142.27.137:1004 during recommissioning > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.51:1004 > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 281175 over-replicated blocks on 10.142.27.51:1004 during recommissioning > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.12:1004 > 2016-05-26 20:13:08,756 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 274880 over-replicated blocks on 10.142.27.12:1004 during recommissioning > 2016-05-26 20:13:08,757 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.15:1004 > 2016-05-26 20:13:17,185 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 286334 over-replicated blocks on 10.142.27.15:1004 during recommissioning > 2016-05-26 20:13:17,185 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.14:1004 > 2016-05-26 20:13:25,369 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager
[jira] [Updated] (HDFS-10477) Stop decommission a rack of DataNodes caused NameNode fail over to standby
[ https://issues.apache.org/jira/browse/HDFS-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-10477: - Attachment: HDFS-10477.002.patch [~kihwal] good idea, thanks. We can release lock in processExtraRedundancyBlocksOnReCommission after it scanned numBlocksPerIteration(default is 1) blocks. > Stop decommission a rack of DataNodes caused NameNode fail over to standby > -- > > Key: HDFS-10477 > URL: https://issues.apache.org/jira/browse/HDFS-10477 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: HDFS-10477.002.patch, HDFS-10477.patch > > > In our cluster, when we stop decommissioning a rack which have 46 DataNodes, > it locked Namesystem for about 7 minutes as below log shows: > {code} > 2016-05-26 20:11:41,697 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.27:1004 > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 285258 over-replicated blocks on 10.142.27.27:1004 during recommissioning > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.118:1004 > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 279923 over-replicated blocks on 10.142.27.118:1004 during recommissioning > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.113:1004 > 2016-05-26 20:12:09,007 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 294307 over-replicated blocks on 10.142.27.113:1004 during recommissioning > 2016-05-26 20:12:09,008 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.117:1004 > 2016-05-26 20:12:18,055 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 314381 over-replicated blocks on 10.142.27.117:1004 during recommissioning > 2016-05-26 20:12:18,056 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.130:1004 > 2016-05-26 20:12:25,938 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 272779 over-replicated blocks on 10.142.27.130:1004 during recommissioning > 2016-05-26 20:12:25,939 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.121:1004 > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 287248 over-replicated blocks on 10.142.27.121:1004 during recommissioning > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.33:1004 > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 299868 over-replicated blocks on 10.142.27.33:1004 during recommissioning > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.137:1004 > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 303914 over-replicated blocks on 10.142.27.137:1004 during recommissioning > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.51:1004 > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 281175 over-replicated blocks on 10.142.27.51:1004 during recommissioning > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.12:1004 > 2016-05-26 20:13:08,756 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 274880 over-replicated blocks on 10.142.27.12:1004 during recommissioning > 2016-05-26 20:13:08,757 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.15:1004 > 2016-05-26 20:13:17,185 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 286334 over-replicated blocks on 10.142.27.15:1004 during recommissioning > 2016-05-26 20:13:17,185 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.14:1004 > 2016-05-26 20:13:25,369 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 280219 over-replicated blocks on 10.142.27.14:1004 during recommissioning > 2016-05-26 20:13:25,370 INFO > org.apache.hadoo
[jira] [Updated] (HDFS-10477) Stop decommission a rack of DataNodes caused NameNode fail over to standby
[ https://issues.apache.org/jira/browse/HDFS-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-10477: - Attachment: HDFS-10477.patch This patch will release write lock after stopped decommission one DataNode, so other handler will have chance to get the write lock to prevent lock Namesystem too long. > Stop decommission a rack of DataNodes caused NameNode fail over to standby > -- > > Key: HDFS-10477 > URL: https://issues.apache.org/jira/browse/HDFS-10477 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: HDFS-10477.patch > > > In our cluster, when we stop decommissioning a rack which have 46 DataNodes, > it locked Namesystem for about 7 minutes as below log shows: > {code} > 2016-05-26 20:11:41,697 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.27:1004 > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 285258 over-replicated blocks on 10.142.27.27:1004 during recommissioning > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.118:1004 > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 279923 over-replicated blocks on 10.142.27.118:1004 during recommissioning > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.113:1004 > 2016-05-26 20:12:09,007 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 294307 over-replicated blocks on 10.142.27.113:1004 during recommissioning > 2016-05-26 20:12:09,008 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.117:1004 > 2016-05-26 20:12:18,055 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 314381 over-replicated blocks on 10.142.27.117:1004 during recommissioning > 2016-05-26 20:12:18,056 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.130:1004 > 2016-05-26 20:12:25,938 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 272779 over-replicated blocks on 10.142.27.130:1004 during recommissioning > 2016-05-26 20:12:25,939 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.121:1004 > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 287248 over-replicated blocks on 10.142.27.121:1004 during recommissioning > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.33:1004 > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 299868 over-replicated blocks on 10.142.27.33:1004 during recommissioning > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.137:1004 > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 303914 over-replicated blocks on 10.142.27.137:1004 during recommissioning > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.51:1004 > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 281175 over-replicated blocks on 10.142.27.51:1004 during recommissioning > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.12:1004 > 2016-05-26 20:13:08,756 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 274880 over-replicated blocks on 10.142.27.12:1004 during recommissioning > 2016-05-26 20:13:08,757 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.15:1004 > 2016-05-26 20:13:17,185 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 286334 over-replicated blocks on 10.142.27.15:1004 during recommissioning > 2016-05-26 20:13:17,185 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.14:1004 > 2016-05-26 20:13:25,369 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 280219 over-replicated blocks on 10.142.27.14:1004 during recommissioning > 2016-05-26 20:13:25,370 INFO > org.apache.hadoop.hdfs.server.bl
[jira] [Updated] (HDFS-10477) Stop decommission a rack of DataNodes caused NameNode fail over to standby
[ https://issues.apache.org/jira/browse/HDFS-10477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-10477: - Status: Patch Available (was: Open) > Stop decommission a rack of DataNodes caused NameNode fail over to standby > -- > > Key: HDFS-10477 > URL: https://issues.apache.org/jira/browse/HDFS-10477 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 2.7.2 >Reporter: yunjiong zhao >Assignee: yunjiong zhao > Attachments: HDFS-10477.patch > > > In our cluster, when we stop decommissioning a rack which have 46 DataNodes, > it locked Namesystem for about 7 minutes as below log shows: > {code} > 2016-05-26 20:11:41,697 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.27:1004 > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 285258 over-replicated blocks on 10.142.27.27:1004 during recommissioning > 2016-05-26 20:11:51,171 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.118:1004 > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 279923 over-replicated blocks on 10.142.27.118:1004 during recommissioning > 2016-05-26 20:11:59,972 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.113:1004 > 2016-05-26 20:12:09,007 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 294307 over-replicated blocks on 10.142.27.113:1004 during recommissioning > 2016-05-26 20:12:09,008 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.117:1004 > 2016-05-26 20:12:18,055 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 314381 over-replicated blocks on 10.142.27.117:1004 during recommissioning > 2016-05-26 20:12:18,056 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.130:1004 > 2016-05-26 20:12:25,938 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 272779 over-replicated blocks on 10.142.27.130:1004 during recommissioning > 2016-05-26 20:12:25,939 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.121:1004 > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 287248 over-replicated blocks on 10.142.27.121:1004 during recommissioning > 2016-05-26 20:12:34,134 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.33:1004 > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 299868 over-replicated blocks on 10.142.27.33:1004 during recommissioning > 2016-05-26 20:12:43,020 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.137:1004 > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 303914 over-replicated blocks on 10.142.27.137:1004 during recommissioning > 2016-05-26 20:12:52,220 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.51:1004 > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 281175 over-replicated blocks on 10.142.27.51:1004 during recommissioning > 2016-05-26 20:13:00,362 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.12:1004 > 2016-05-26 20:13:08,756 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 274880 over-replicated blocks on 10.142.27.12:1004 during recommissioning > 2016-05-26 20:13:08,757 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.15:1004 > 2016-05-26 20:13:17,185 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 286334 over-replicated blocks on 10.142.27.15:1004 during recommissioning > 2016-05-26 20:13:17,185 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.14:1004 > 2016-05-26 20:13:25,369 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated > 280219 over-replicated blocks on 10.142.27.14:1004 during recommissioning > 2016-05-26 20:13:25,370 INFO > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop > Decommissioning 10.142.27.28:1004 > 2016-05-26 20:13:33,768 INFO > org.apache.hadoop.hdfs.server.blockmanagement.BlockManage
[jira] [Created] (HDFS-10477) Stop decommission a rack of DataNodes caused NameNode fail over to standby
yunjiong zhao created HDFS-10477: Summary: Stop decommission a rack of DataNodes caused NameNode fail over to standby Key: HDFS-10477 URL: https://issues.apache.org/jira/browse/HDFS-10477 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 2.7.2 Reporter: yunjiong zhao Assignee: yunjiong zhao In our cluster, when we stop decommissioning a rack which have 46 DataNodes, it locked Namesystem for about 7 minutes as below log shows: {code} 2016-05-26 20:11:41,697 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop Decommissioning 10.142.27.27:1004 2016-05-26 20:11:51,171 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 285258 over-replicated blocks on 10.142.27.27:1004 during recommissioning 2016-05-26 20:11:51,171 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop Decommissioning 10.142.27.118:1004 2016-05-26 20:11:59,972 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 279923 over-replicated blocks on 10.142.27.118:1004 during recommissioning 2016-05-26 20:11:59,972 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop Decommissioning 10.142.27.113:1004 2016-05-26 20:12:09,007 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 294307 over-replicated blocks on 10.142.27.113:1004 during recommissioning 2016-05-26 20:12:09,008 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop Decommissioning 10.142.27.117:1004 2016-05-26 20:12:18,055 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 314381 over-replicated blocks on 10.142.27.117:1004 during recommissioning 2016-05-26 20:12:18,056 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop Decommissioning 10.142.27.130:1004 2016-05-26 20:12:25,938 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 272779 over-replicated blocks on 10.142.27.130:1004 during recommissioning 2016-05-26 20:12:25,939 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop Decommissioning 10.142.27.121:1004 2016-05-26 20:12:34,134 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 287248 over-replicated blocks on 10.142.27.121:1004 during recommissioning 2016-05-26 20:12:34,134 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop Decommissioning 10.142.27.33:1004 2016-05-26 20:12:43,020 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 299868 over-replicated blocks on 10.142.27.33:1004 during recommissioning 2016-05-26 20:12:43,020 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop Decommissioning 10.142.27.137:1004 2016-05-26 20:12:52,220 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 303914 over-replicated blocks on 10.142.27.137:1004 during recommissioning 2016-05-26 20:12:52,220 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop Decommissioning 10.142.27.51:1004 2016-05-26 20:13:00,362 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 281175 over-replicated blocks on 10.142.27.51:1004 during recommissioning 2016-05-26 20:13:00,362 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop Decommissioning 10.142.27.12:1004 2016-05-26 20:13:08,756 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 274880 over-replicated blocks on 10.142.27.12:1004 during recommissioning 2016-05-26 20:13:08,757 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop Decommissioning 10.142.27.15:1004 2016-05-26 20:13:17,185 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 286334 over-replicated blocks on 10.142.27.15:1004 during recommissioning 2016-05-26 20:13:17,185 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop Decommissioning 10.142.27.14:1004 2016-05-26 20:13:25,369 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 280219 over-replicated blocks on 10.142.27.14:1004 during recommissioning 2016-05-26 20:13:25,370 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop Decommissioning 10.142.27.28:1004 2016-05-26 20:13:33,768 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 280623 over-replicated blocks on 10.142.27.28:1004 during recommissioning 2016-05-26 20:13:33,769 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager: Stop Decommissioning 10.142.27.119:1004 2016-05-26 20:13:42,816 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Invalidated 294675 over-replicated blocks on 10.142.27.119:1004 during recommissioning 2016-05-26 20:13:42,816 INFO or
[jira] [Updated] (HDFS-9959) add log when block removed from last live datanode
[ https://issues.apache.org/jira/browse/HDFS-9959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-9959: Attachment: HDFS-9959.5.patch Thanks Arpit Agarwal. Update the patch according to Arpit Agarwal's comments. > add log when block removed from last live datanode > -- > > Key: HDFS-9959 > URL: https://issues.apache.org/jira/browse/HDFS-9959 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Minor > Attachments: HDFS-9959.1.patch, HDFS-9959.2.patch, HDFS-9959.3.patch, > HDFS-9959.3.withtest.patch, HDFS-9959.4.patch, HDFS-9959.5.patch, > HDFS-9959.patch > > > Add logs like "BLOCK* No live nodes contain block blk_1073741825_1001, last > datanode contain it is node: 127.0.0.1:65341" in BlockStateChange should help > to identify which datanode should be fixed first to recover missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9959) add log when block removed from last live datanode
[ https://issues.apache.org/jira/browse/HDFS-9959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-9959: Attachment: HDFS-9959.4.patch Thanks Tsz Wo Nicholas Sze for review the patch. Update the patch to use Object. > add log when block removed from last live datanode > -- > > Key: HDFS-9959 > URL: https://issues.apache.org/jira/browse/HDFS-9959 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Minor > Attachments: HDFS-9959.1.patch, HDFS-9959.2.patch, HDFS-9959.3.patch, > HDFS-9959.3.withtest.patch, HDFS-9959.4.patch, HDFS-9959.patch > > > Add logs like "BLOCK* No live nodes contain block blk_1073741825_1001, last > datanode contain it is node: 127.0.0.1:65341" in BlockStateChange should help > to identify which datanode should be fixed first to recover missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9959) add log when block removed from last live datanode
[ https://issues.apache.org/jira/browse/HDFS-9959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-9959: Attachment: HDFS-9959.3.withtest.patch HDFS-9959.3.patch Update patch according to the comments. I use below code to do a test, but I'm not sure whether I should add this test case since I verify it manually on the output, let me know that we can mock logger. So I added two patches, one with test case, one didn't. {code} @Test public void testMissingBlockLog () { BlocksMap.MissingBlockLog.init(); for (Long l = 1l; l < 32L; l++) { BlocksMap.MissingBlockLog.add(new Block(l)); } BlocksMap.MissingBlockLog.log(new DatanodeID("127.0.0.1", "localhost", "uuid-9959", 9959, 9960, 9961, 9962)); } {code} And the output which this test case generated is like below: {quote} $cat org.apache.hadoop.hdfs.server.blockmanagement.TestBlocksMap-output.txt 2016-03-22 11:34:25,758 [main] WARN BlockStateChange (BlocksMap.java:log(73)) - After removed 127.0.0.1:9959, no live nodes contain the following 10 blocks: blk_1_0 blk_2_0 blk_3_0 blk_4_0 blk_5_0 blk_6_0 blk_7_0 blk_8_0 blk_9_0 blk_10_0 2016-03-22 11:34:25,761 [main] WARN BlockStateChange (BlocksMap.java:log(73)) - After removed 127.0.0.1:9959, no live nodes contain the following 10 blocks: blk_11_0 blk_12_0 blk_13_0 blk_14_0 blk_15_0 blk_16_0 blk_17_0 blk_18_0 blk_19_0 blk_20_0 2016-03-22 11:34:25,761 [main] WARN BlockStateChange (BlocksMap.java:log(73)) - After removed 127.0.0.1:9959, no live nodes contain the following 10 blocks: blk_21_0 blk_22_0 blk_23_0 blk_24_0 blk_25_0 blk_26_0 blk_27_0 blk_28_0 blk_29_0 blk_30_0 2016-03-22 11:34:25,761 [main] WARN BlockStateChange (BlocksMap.java:log(81)) - After removed 127.0.0.1:9959, no live nodes contain the following 1 blocks: blk_31_0 {quote} Thanks. > add log when block removed from last live datanode > -- > > Key: HDFS-9959 > URL: https://issues.apache.org/jira/browse/HDFS-9959 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Minor > Attachments: HDFS-9959.1.patch, HDFS-9959.2.patch, HDFS-9959.3.patch, > HDFS-9959.3.withtest.patch, HDFS-9959.patch > > > Add logs like "BLOCK* No live nodes contain block blk_1073741825_1001, last > datanode contain it is node: 127.0.0.1:65341" in BlockStateChange should help > to identify which datanode should be fixed first to recover missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9959) add log when block removed from last live datanode
[ https://issues.apache.org/jira/browse/HDFS-9959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-9959: Attachment: HDFS-9959.2.patch How about this one? In case of extreme case, for example, all DataNodes are dead, and the missing blocks may have more than hundred thousand, logging all of them might take very long time. {code} NameNode.blockStateChangeLog.warn("After removed " + dn + ", no live nodes contain the following " + missing.size() + " blocks: " + missing); {code} Which solution is better: 1. add only first thousand blocks and ignore others 2. if there are more than thousand blocks missing, only logging first thousand? > add log when block removed from last live datanode > -- > > Key: HDFS-9959 > URL: https://issues.apache.org/jira/browse/HDFS-9959 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Minor > Attachments: HDFS-9959.1.patch, HDFS-9959.2.patch, HDFS-9959.patch > > > Add logs like "BLOCK* No live nodes contain block blk_1073741825_1001, last > datanode contain it is node: 127.0.0.1:65341" in BlockStateChange should help > to identify which datanode should be fixed first to recover missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9959) add log when block removed from last live datanode
[ https://issues.apache.org/jira/browse/HDFS-9959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204604#comment-15204604 ] yunjiong zhao commented on HDFS-9959: - +1 for this. > add log when block removed from last live datanode > -- > > Key: HDFS-9959 > URL: https://issues.apache.org/jira/browse/HDFS-9959 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Minor > Attachments: HDFS-9959.1.patch, HDFS-9959.patch > > > Add logs like "BLOCK* No live nodes contain block blk_1073741825_1001, last > datanode contain it is node: 127.0.0.1:65341" in BlockStateChange should help > to identify which datanode should be fixed first to recover missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9959) add log when block removed from last live datanode
[ https://issues.apache.org/jira/browse/HDFS-9959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202137#comment-15202137 ] yunjiong zhao commented on HDFS-9959: - I understand, thanks. > add log when block removed from last live datanode > -- > > Key: HDFS-9959 > URL: https://issues.apache.org/jira/browse/HDFS-9959 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Minor > Attachments: HDFS-9959.1.patch, HDFS-9959.patch > > > Add logs like "BLOCK* No live nodes contain block blk_1073741825_1001, last > datanode contain it is node: 127.0.0.1:65341" in BlockStateChange should help > to identify which datanode should be fixed first to recover missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9959) add log when block removed from last live datanode
[ https://issues.apache.org/jira/browse/HDFS-9959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-9959: Attachment: HDFS-9959.1.patch Update patch: 1. log after release the write lock 2. change error to info. > add log when block removed from last live datanode > -- > > Key: HDFS-9959 > URL: https://issues.apache.org/jira/browse/HDFS-9959 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Minor > Attachments: HDFS-9959.1.patch, HDFS-9959.patch > > > Add logs like "BLOCK* No live nodes contain block blk_1073741825_1001, last > datanode contain it is node: 127.0.0.1:65341" in BlockStateChange should help > to identify which datanode should be fixed first to recover missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9959) add log when block removed from last live datanode
[ https://issues.apache.org/jira/browse/HDFS-9959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15194482#comment-15194482 ] yunjiong zhao commented on HDFS-9959: - It shouldn't, it will only print those blocks which was removed from last live datanode and do belong to a file. In real production cluster, this should not happen often, otherwise there will be lots of corrupt files. > add log when block removed from last live datanode > -- > > Key: HDFS-9959 > URL: https://issues.apache.org/jira/browse/HDFS-9959 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Minor > Attachments: HDFS-9959.patch > > > Add logs like "BLOCK* No live nodes contain block blk_1073741825_1001, last > datanode contain it is node: 127.0.0.1:65341" in BlockStateChange should help > to identify which datanode should be fixed first to recover missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9959) add log when block removed from last live datanode
[ https://issues.apache.org/jira/browse/HDFS-9959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15194351#comment-15194351 ] yunjiong zhao commented on HDFS-9959: - If removeNode(Block b, DatanodeDescriptor node) was invoked by DatanodeManager.removeDatanode due to DatanodeManager found a lost heartbeat datanode, the block may still stored on that datanode's disk safely. So after resolve the temporary issue (power, network...) and start the datanode process again, we will have missing block back. > add log when block removed from last live datanode > -- > > Key: HDFS-9959 > URL: https://issues.apache.org/jira/browse/HDFS-9959 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Minor > Attachments: HDFS-9959.patch > > > Add logs like "BLOCK* No live nodes contain block blk_1073741825_1001, last > datanode contain it is node: 127.0.0.1:65341" in BlockStateChange should help > to identify which datanode should be fixed first to recover missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HDFS-9959) add log when block removed from last live datanode
[ https://issues.apache.org/jira/browse/HDFS-9959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15194245#comment-15194245 ] yunjiong zhao commented on HDFS-9959: - For other logs, it is not that convenient. For example, if the block was created years ago, we may not find anything in recent BlockStateChange log. For performance penalty, I think it should be fine because it won't generate lots of new message. > add log when block removed from last live datanode > -- > > Key: HDFS-9959 > URL: https://issues.apache.org/jira/browse/HDFS-9959 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Minor > Attachments: HDFS-9959.patch > > > Add logs like "BLOCK* No live nodes contain block blk_1073741825_1001, last > datanode contain it is node: 127.0.0.1:65341" in BlockStateChange should help > to identify which datanode should be fixed first to recover missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9959) add log when block removed from last live datanode
[ https://issues.apache.org/jira/browse/HDFS-9959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-9959: Status: Patch Available (was: Open) > add log when block removed from last live datanode > -- > > Key: HDFS-9959 > URL: https://issues.apache.org/jira/browse/HDFS-9959 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Minor > Attachments: HDFS-9959.patch > > > Add logs like "BLOCK* No live nodes contain block blk_1073741825_1001, last > datanode contain it is node: 127.0.0.1:65341" in BlockStateChange should help > to identify which datanode should be fixed first to recover missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-9959) add log when block removed from last live datanode
[ https://issues.apache.org/jira/browse/HDFS-9959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yunjiong zhao updated HDFS-9959: Attachment: HDFS-9959.patch > add log when block removed from last live datanode > -- > > Key: HDFS-9959 > URL: https://issues.apache.org/jira/browse/HDFS-9959 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: yunjiong zhao >Assignee: yunjiong zhao >Priority: Minor > Attachments: HDFS-9959.patch > > > Add logs like "BLOCK* No live nodes contain block blk_1073741825_1001, last > datanode contain it is node: 127.0.0.1:65341" in BlockStateChange should help > to identify which datanode should be fixed first to recover missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HDFS-9959) add log when block removed from last live datanode
yunjiong zhao created HDFS-9959: --- Summary: add log when block removed from last live datanode Key: HDFS-9959 URL: https://issues.apache.org/jira/browse/HDFS-9959 Project: Hadoop HDFS Issue Type: Improvement Reporter: yunjiong zhao Assignee: yunjiong zhao Priority: Minor Add logs like "BLOCK* No live nodes contain block blk_1073741825_1001, last datanode contain it is node: 127.0.0.1:65341" in BlockStateChange should help to identify which datanode should be fixed first to recover missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)