[ 
https://issues.apache.org/jira/browse/HDFS-16332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka reassigned HDFS-16332:
------------------------------------

    Assignee: Shinya Yoshida

> Expired block token causes slow read due to missing handling in sasl handshake
> ------------------------------------------------------------------------------
>
>                 Key: HDFS-16332
>                 URL: https://issues.apache.org/jira/browse/HDFS-16332
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, dfs, dfsclient
>    Affects Versions: 2.8.5, 3.3.1
>            Reporter: Shinya Yoshida
>            Assignee: Shinya Yoshida
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: Screenshot from 2021-11-18 12-11-34.png, Screenshot from 
> 2021-11-18 12-14-29.png, Screenshot from 2021-11-18 13-31-35.png
>
>          Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> We're operating the HBase 1.4.x cluster on Hadoop 2.8.5.
> We're recently evaluating Kerberos secured HBase and Hadoop cluster with 
> production load and we observed HBase's response slows >= several seconds, 
> and about several minutes for worst-case (about once~three times a month).
> The following image is a scatter plot of HBase's response slow, each circle 
> is each base's slow response log.
> The X-axis is the date time of the log occurred, the Y-axis is the response 
> slow time.
>  !Screenshot from 2021-11-18 12-14-29.png! 
> We could reproduce this issue by reducing "dfs.block.access.token.lifetime" 
> and we could figure out the cause.
> (We used dfs.block.access.token.lifetime=60, i.e. 1 hour)
> When hedged read enabled:
>  !Screenshot from 2021-11-18 12-11-34.png! 
> When hedged read disabled:
>  !Screenshot from 2021-11-18 13-31-35.png! 
> As you can see, it's worst if the hedged read is enabled. However, it happens 
> whether the hedged read is enabled or not.
> This impacts our 99%tile response time.
> This happens when the block token is expired and the root cause is the wrong 
> handling of the InvalidToken exception in sasl handshake in 
> SaslDataTransferServer.
> I propose to add a new response code for DataTransferEncryptorStatus to 
> request the client to update the block token like DataTransferProtos does.
> The test code and patch is available in 
> https://github.com/apache/hadoop/pull/3677
> We could reproduce this issue by the following test code in 2.8.5 branch and 
> trunk as I tested
> {code:java}
> // HDFS is configured as secure cluster
> try (FileSystem fs = newFileSystem();
>      FSDataInputStream in = fs.open(PATH)) {
>     waitBlockTokenExpired(in);
>     in.read(0, bytes, 0, bytes.length)
> }
> private void waitBlockTokenExpired(FSDataInputStream in1) throws Exception {
>     DFSInputStream innerStream = (DFSInputStream) in1.getWrappedStream();
>     for (LocatedBlock block : innerStream.getAllBlocks()) {
>         while (!SecurityTestUtil.isBlockTokenExpired(block.getBlockToken())) {
>             Thread.sleep(100);
>         }
>     }
> }
> {code}
> Here is the log we got, we added a custom log before and after the block 
> token refresh:
> https://github.com/bitterfox/hadoop/commit/173a9f876f2264b76af01d658f624197936fd79c
> {code}
> 2021-11-16 09:40:20,330 WARN  [hedgedRead-247] impl.BlockReaderFactory: I/O 
> error constructing remote block reader.
> java.io.IOException: DIGEST-MD5: IO error acquiring password
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:420)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:475)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:389)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
>         at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:568)
>         at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2880)
>         at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:815)
>         at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:740)
>         at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:385)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:697)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1237)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1205)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.access$000(DFSInputStream.java:98)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1189)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1181)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> 2021-11-16 09:40:20,331 WARN  [hedgedRead-247] hdfs.DFSClient: Connection 
> failure: Failed to connect to /10.10.10.1:12345 for file 
> /hbase/data/default/test_table/<encoded-region-name>/o/<store-file> for block 
> BP-123456789-10.20.20.1-1629777195910:blk_9876543212_1357924:java.io.IOException:
>  DIGEST-MD5: IO error acquiring password
> java.io.IOException: DIGEST-MD5: IO error acquiring password
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:420)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:475)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:389)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
>         at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:568)
>         at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2880)
>         at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:815)
>         at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:740)
>         at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:385)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:697)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1237)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1205)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.access$000(DFSInputStream.java:98)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1189)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1181)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> 2021-11-16 09:40:20,332 WARN  [hedgedRead-247] impl.BlockReaderFactory: I/O 
> error constructing remote block reader.
> java.io.IOException: DIGEST-MD5: IO error acquiring password
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:420)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:475)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:389)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
>         at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:568)
>         at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2880)
>         at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:815)
>         at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:740)
>         at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:385)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:697)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1237)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1205)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.access$000(DFSInputStream.java:98)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1189)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1181)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> 2021-11-16 09:40:20,332 WARN  [hedgedRead-247] hdfs.DFSClient: Connection 
> failure: Failed to connect to /10.10.10.2:12345 for file 
> /hbase/data/default/test_table/<encoded-region-name>/o/<store-file> for block 
> BP-123456789-10.20.20.1-1629777195910:blk_9876543212_1357924:java.io.IOException:
>  DIGEST-MD5: IO error acquiring password
> java.io.IOException: DIGEST-MD5: IO error acquiring password
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:420)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:475)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:389)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
>         at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:568)
>         at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2880)
>         at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:815)
>         at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:740)
>         at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:385)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:697)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1237)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1205)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.access$000(DFSInputStream.java:98)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1189)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1181)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> 2021-11-16 09:40:20,333 WARN  [hedgedRead-247] impl.BlockReaderFactory: I/O 
> error constructing remote block reader.
> java.io.IOException: DIGEST-MD5: IO error acquiring password
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:420)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:475)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:389)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
>         at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:568)
>         at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2880)
>         at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:815)
>         at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:740)
>         at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:385)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:697)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1237)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1205)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.access$000(DFSInputStream.java:98)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1189)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1181)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> 2021-11-16 09:40:20,334 WARN  [hedgedRead-247] hdfs.DFSClient: Connection 
> failure: Failed to connect to /10.10.10.3:12345 for file 
> /hbase/data/default/test_table/<encoded-region-name>/o/<store-file> for block 
> BP-123456789-10.20.20.1-1629777195910:blk_9876543212_1357924:java.io.IOException:
>  DIGEST-MD5: IO error acquiring password
> java.io.IOException: DIGEST-MD5: IO error acquiring password
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:420)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:475)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:389)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160)
>         at 
> org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:568)
>         at 
> org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2880)
>         at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:815)
>         at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:740)
>         at 
> org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:385)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:697)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1237)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1205)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream.access$000(DFSInputStream.java:98)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1189)
>         at 
> org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1181)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> 2021-11-16 09:40:20,334 WARN  
> [RpcServer.default.RWQ.Fifo.read.handler=363,queue=28,port=10001] 
> hdfs.DFSClient: No live nodes contain block 
> BP-123456789-10.20.20.1-1629777195910:blk_9876543212_1357924 after checking 
> nodes = [DatanodeInfoWithStorage[10.10.10.1:12345,DS-*****,DISK], 
> DatanodeInfoWithStorage[10.10.10.2:12345,DS-*****,DISK], 
> DatanodeInfoWithStorage[10.10.10.3:12345,DS-*****,DISK]], ignoredNodes = 
> [DatanodeInfoWithStorage[10.10.10.1:12345,DS-*****,DISK], 
> DatanodeInfoWithStorage[10.10.10.2:12345,DS-*****,DISK], 
> DatanodeInfoWithStorage[10.10.10.3:12345,DS-*****,DISK]]
> 2021-11-16 09:40:20,334 INFO  
> [RpcServer.default.RWQ.Fifo.read.handler=363,queue=28,port=10001] 
> hdfs.DFSClient: Could not obtain 
> BP-123456789-10.20.20.1-1629777195910:blk_9876543212_1357924 from any node:  
> No live nodes contain current block Block locations: 
> DatanodeInfoWithStorage[10.10.10.1:12345,DS-*****,DISK] 
> DatanodeInfoWithStorage[10.10.10.2:12345,DS-*****,DISK] 
> DatanodeInfoWithStorage[10.10.10.3:12345,DS-*****,DISK] Dead nodes:  
> DatanodeInfoWithStorage[10.10.10.1:12345,DS-*****,DISK] 
> DatanodeInfoWithStorage[10.10.10.2:12345,DS-*****,DISK] 
> DatanodeInfoWithStorage[10.10.10.3:12345,DS-*****,DISK] Ignored nodes:  
> DatanodeInfoWithStorage[10.10.10.1:12345,DS-*****,DISK] 
> DatanodeInfoWithStorage[10.10.10.2:12345,DS-*****,DISK] 
> DatanodeInfoWithStorage[10.10.10.3:12345,DS-*****,DISK]. Will get new block 
> locations from namenode and retry...
> 2021-11-16 09:40:20,334 WARN  
> [RpcServer.default.RWQ.Fifo.read.handler=363,queue=28,port=10001] 
> hdfs.DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 
> 523.0985342660828 msec.
> 2021-11-16 09:40:20,860 WARN  
> [RpcServer.default.RWQ.Fifo.read.handler=363,queue=28,port=10001] 
> hdfs.DFSClient: No live nodes contain block 
> BP-123456789-10.20.20.1-1629777195910:blk_9876543212_1357924 after checking 
> nodes = [DatanodeInfoWithStorage[10.10.10.2:12345,DS-*****,DISK], 
> DatanodeInfoWithStorage[10.10.10.1:12345,DS-*****,DISK], 
> DatanodeInfoWithStorage[10.10.10.3:12345,DS-*****,DISK]], ignoredNodes = 
> [DatanodeInfoWithStorage[10.10.10.1:12345,DS-*****,DISK], 
> DatanodeInfoWithStorage[10.10.10.2:12345,DS-*****,DISK], 
> DatanodeInfoWithStorage[10.10.10.3:12345,DS-*****,DISK]]
> 2021-11-16 09:40:20,860 INFO  
> [RpcServer.default.RWQ.Fifo.read.handler=363,queue=28,port=10001] 
> hdfs.DFSClient: Could not obtain 
> BP-123456789-10.20.20.1-1629777195910:blk_9876543212_1357924 from any node:  
> No live nodes contain current block Block locations: 
> DatanodeInfoWithStorage[10.10.10.2:12345,DS-*****,DISK] 
> DatanodeInfoWithStorage[10.10.10.1:12345,DS-*****,DISK] 
> DatanodeInfoWithStorage[10.10.10.3:12345,DS-*****,DISK] Dead nodes:  Ignored 
> nodes:  DatanodeInfoWithStorage[10.10.10.1:12345,DS-*****,DISK] 
> DatanodeInfoWithStorage[10.10.10.2:12345,DS-*****,DISK] 
> DatanodeInfoWithStorage[10.10.10.3:12345,DS-*****,DISK]. Will get new block 
> locations from namenode and retry...
> 2021-11-16 09:40:20,860 WARN  
> [RpcServer.default.RWQ.Fifo.read.handler=363,queue=28,port=10001] 
> hdfs.DFSClient: DFS chooseDataNode: got # 2 IOException, will wait for 
> 8025.758935908773 msec.
> 2021-11-16 09:40:28,887 INFO  
> [RpcServer.default.RWQ.Fifo.read.handler=363,queue=28,port=10001] 
> hdfs.DFSClient: Could not obtain 
> BP-123456789-10.20.20.1-1629777195910:blk_9876543212_1357924 from any node:  
> No live nodes contain current block Block locations: 
> DatanodeInfoWithStorage[10.10.10.2:12345,DS-*****,
> DISK] DatanodeInfoWithStorage[10.10.10.1:12345,DS-*****,DISK] 
> DatanodeInfoWithStorage[10.10.10.3:12345,DS-*****,DISK] Dead nodes:  Ignored 
> nodes:  DatanodeInfoWithStorage[10.10.10.1:12345,DS-*****,DISK] 
> DatanodeInfoWithStorage[10.10.10.2:1146
> 2,DS-*****,DISK] DatanodeInfoWithStorage[10.10.10.3:12345,DS-*****,DISK]. 
> Will get new block locations from namenode and retry...
> 2021-11-16 09:40:28,887 WARN  
> [RpcServer.default.RWQ.Fifo.read.handler=363,queue=28,port=10001] 
> hdfs.DFSClient: DFS chooseDataNode: got # 3 IOException, will wait for 
> 7995.183785064122 msec.
> 2021-11-16 09:40:59,922 WARN  
> [RpcServer.default.RWQ.Fifo.read.handler=371,queue=36,port=10001] 
> ipc.RpcServer: (responseTooSlow): 
> {"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiRequest)","multi.gets":3,"starttimems":"1637023220329","responsesize":"64393","method":"Multi","param":"region\u003d
>  test_table,***,1631095286710.<encoded-region-name>., for 3 action(s) and 1st 
> row 
> keyTRUNCATED","processingtimems":39592,"client":"10.30.30.1:56789","queuetimems":0,"multi.servicecalls":0,"class":"HRegionServer","multi.mutations":0}
> {code}
> As you can see, you see the IOException and then all datanodes are considered 
> dead nodes.
> Also, you couldn't see the block token refresh occurring.
> So the logic of refresh block token isn't performed for some cases and all 
> datanodes are marked as dead and then chooseDataNode and refetchLocations is 
> triggered with the sleep.
> refetchLocations sleeps up to `dfsClient.getConf().getTimeWindow()` default 
> is 3 second for first failure.
> https://github.com/apache/hadoop/blob/91af256a5b44925e5dfdf333293251a19685ba2a/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L982-L1046
> That's why we see slows 0~3 secs for hedged read disabled.
> refetchLocations clears dead nodes, but ignored node that managed in 
> hedgedFetchBlockByteRange are not cleared, so hedgedFetchBlockByteRange tries 
> refetchLocations many times up to 
> `dfsClient.getConf().getMaxBlockAcquireFailures()`
> (sleep in refetchLocations is computed by `timeWindow * failure + timeWindow 
> * (failure + 1) * nextDouble()` and that's why we see several minutes 
> response slow when hedged read is enabled)
> After these retries, BlockMissingException is thrown.
> https://github.com/apache/hadoop/blob/91af256a5b44925e5dfdf333293251a19685ba2a/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1343-L1386
> We dig into the IOException stacktrace and we found sasl handshake returns an 
> error.
> We added the log in SaslDataTransferServer side:
> https://github.com/bitterfox/hadoop/tree/saslhandshake-error-log
> and then we got the following DN error:
> {code}
> 2021-11-16 16:11:06,480 ERROR 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer: 
> Generic sasl error for client /10.10.10.4:45678
> javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password 
> [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Block 
> token with block_token_identifier (expiryDate=1637046306844, keyId=<keyid>, 
> userId=hbase, blockPoolId=BP-123456789-10.20.20.1-1629777195910, 
> blockId=<blockid>, access modes=[READ]) is expired.]
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598)
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127)
>         at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:231)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Block 
> token with block_token_identifier (expiryDate=1637046306844, keyId=<keyid>, 
> userId=hbase, blockPoolId=BP-123456789-10.20.20.1-1629777195910, 
> blockId=<blockid>, access modes=[READ]) is expired.
>         at 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:377)
>         at 
> org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:80)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297)
>         at 
> org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241)
>         at 
> com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589)
>         ... 7 more
> {code}
> As you can see the expired token is used and retrievePassword used for sasl 
> throws InvalidToken exception.
> retrievePassword: 
> https://github.com/apache/hadoop/blob/91af256a5b44925e5dfdf333293251a19685ba2a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/security/token/block/BlockTokenSecretManager.java#L501-L506
> So if a connection is established newly after the block token is expired, 
> this issue happens.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to