[ 
https://issues.apache.org/jira/browse/HDFS-13441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437740#comment-16437740
 ] 

yunjiong zhao edited comment on HDFS-13441 at 4/13/18 9:50 PM:
---------------------------------------------------------------

{quote} 

BlockKey is usually synchronized aggressively – IIRC every 2.5 hours, and 
BlockKey's life time is much longer than that (can't recall right away) so it's 
surprising to me a single missing heartbeat would cause the error you 
mentioned. There's probably something deeper you need to dig into.
{quote}
DataNode gets block keys from NameNode happens at two places.

One is when DataNode registers to NameNode.

Another one is via heartbeat, by default happens every 600 minutes.

By default,
{quote}<property>
 <name>dfs.block.access.key.update.interval</name>
 <value>600</value>
 <description>
 Interval in minutes at which namenode updates its access keys.
 </description>
 </property>
{quote}
{quote}<property>
 <name>dfs.block.access.token.lifetime</name>
 <value>600</value>
 <description>The lifetime of access tokens in minutes.</description>
 </property>
{quote}
[~jojochuang] , double checked the code and log, the DataNode must be missing 
*two* heartbeats from Standby NameNode which contains new block key just before 
Standby NameNode become active.

We have more than 2000 Datanodes in this cluster, if more than three Datanodes 
doesn't have the current Block key, worst case is users will not able to read 
some blocks for 10 at most hours.


was (Author: zhaoyunjiong):
{quote} 

BlockKey is usually synchronized aggressively – IIRC every 2.5 hours, and 
BlockKey's life time is much longer than that (can't recall right away) so it's 
surprising to me a single missing heartbeat would cause the error you 
mentioned. There's probably something deeper you need to dig into.
{quote}
DataNode gets block keys from NameNode happens at two places.

One is when DataNode registers to NameNode.

Another one is via heartbeat, by default happens every 600 minutes.

By default,
{quote}<property>
 <name>dfs.block.access.key.update.interval</name>
 <value>600</value>
 <description>
 Interval in minutes at which namenode updates its access keys.
 </description>
</property>{quote}
{quote}<property>
 <name>dfs.block.access.token.lifetime</name>
 <value>600</value>
 <description>The lifetime of access tokens in minutes.</description>
</property>{quote}
 

Not any single heartbeat of course.

The missing heartbeat must from Standby NameNode which contains new block key, 
and Standby NameNode transferred to Active NameNode after that missing 
heartbeat happened and before the new heartbeat which contains new block keys.

We have more than 2000 Datanodes in this cluster, if more than three Datanodes 
doesn't have the current Block key, worst case is users will not able to read 
some blocks for 10 hours.

> DataNode missed BlockKey update from NameNode due to HeartbeatResponse was 
> dropped
> ----------------------------------------------------------------------------------
>
>                 Key: HDFS-13441
>                 URL: https://issues.apache.org/jira/browse/HDFS-13441
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, namenode
>    Affects Versions: 2.7.1
>            Reporter: yunjiong zhao
>            Assignee: yunjiong zhao
>            Priority: Major
>         Attachments: HDFS-13441.patch
>
>
> After NameNode failover, lots of application failed due to some DataNodes 
> can't re-compute password from block token.
> {code:java}
> 2018-04-11 20:10:52,448 ERROR 
> org.apache.hadoop.hdfs.server.datanode.DataNode: 
> hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com:50010:DataXceiver error 
> processing unknown operation  src: /10.142.74.116:57404 dst: 
> /10.142.77.45:50010
> javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
> [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=1523538652448, 
> keyId=1762737944, userId=hadoop, 
> blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, 
> access modes=[WRITE]), since the required block key (keyID=1762737944) 
> doesn't exist.]
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598)
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:194)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't 
> re-compute password for block_token_identifier (expiryDate=1523538652448, 
> keyId=1762737944, userId=hadoop, 
> blockPoolId=BP-36315570-10.103.108.13-1423055488042, blockId=12142862700, 
> access modes=[WRITE]), since the required block key (keyID=1762737944) 
> doesn't exist.
>         at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:382)
>         at 
> org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:79)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241)
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
>         ... 7 more
> {code}
>  
> In the DataNode log, we didn't see DataNode update block keys around 
> 2018-04-11 09:55:00 and around 2018-04-11 19:55:00.
> {code:java}
> 2018-04-10 14:51:36,424 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-10 23:55:38,420 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-11 00:51:34,792 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-11 10:51:39,403 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-11 20:51:44,422 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-12 02:54:47,855 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> 2018-04-12 05:55:44,456 INFO 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager: Setting 
> block keys
> {code}
> The reason is there is SocketTimeOutException when sending heartbeat to 
> StandbyNameNode
> {code:java}
> 2018-04-11 09:55:34,699 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> IOException in offerService
> java.net.SocketTimeoutException: Call From 
> hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com/10.142.77.45 to 
> ares-nn.vip.ebay.com:8030 failed on socket timeout exception: 
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/10.142.77.45:48803 remote=ares-nn.vip.ebay.com/10.103.108.200:8030]; 
> For more details see:  http://wiki.apache.org/hadoop/SocketTimeout
>         at sun.reflect.GeneratedConstructorAccessor32.newInstance(Unknown 
> Source)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801)
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1430)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1363)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>         at com.sun.proxy.$Proxy16.sendHeartbeat(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:155)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:427)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:543)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:725)
>         at java.lang.Thread.run(Thread.java:745)
> 2018-04-11 19:56:13,308 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> IOException in offerService
> java.net.SocketTimeoutException: Call From 
> hdc3-lvs01-400-1701-048.stratus.lvs.ebay.com/10.142.77.45 to 
> ares-nn.vip.ebay.com:8030 failed on socket timeout exception: 
> java.net.SocketTimeoutException: 60000 millis timeout while waiting for 
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected 
> local=/10.142.77.45:50381 remote=ares-nn.vip.ebay.com/10.103.108.200:8030]; 
> For more details see:  http://wiki.apache.org/hadoop/SocketTimeout
>         at sun.reflect.GeneratedConstructorAccessor32.newInstance(Unknown 
> Source)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>         at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:801)
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1430)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1363)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>         at com.sun.proxy.$Proxy16.sendHeartbeat(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:155)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:427)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:543)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:725)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.SocketTimeoutException: 60000 millis timeout while 
> waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/10.142.77.45:50381 
> remote=ares-nn.vip.ebay.com/10.103.108.200:8030]
>         at 
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
>         at 
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at java.io.FilterInputStream.read(FilterInputStream.java:133)
>         at 
> org.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:523)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at 
> org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1087)
>         at org.apache.hadoop.ipc.Client$Connection.run(Client.java:982)
> {code}
> In the Standby NameNode, we can see it dropped the HeartbeatResponses which 
> contain new BlockKey.
> {code:java}
> 2018-04-11 09:53:34,293 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 25 on 8030: skipped 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.142.77.45:54371 Call#13364496 Retry#0
> 2018-04-11 09:56:42,436 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 8 on 8030: skipped 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.142.77.45:48803 Call#13364569 Retry#0
> 2018-04-11 19:54:13,092 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 43 on 8030: skipped 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.142.77.45:52870 Call#13385487 Retry#0
> 2018-04-11 19:57:21,774 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 50 on 8030: skipped 
> org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat from 
> 10.142.77.45:35918 Call#13385562 Retry#0
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to