[ https://issues.apache.org/jira/browse/HDFS-16332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Akira Ajisaka reassigned HDFS-16332: ------------------------------------ Assignee: Shinya Yoshida > Expired block token causes slow read due to missing handling in sasl handshake > ------------------------------------------------------------------------------ > > Key: HDFS-16332 > URL: https://issues.apache.org/jira/browse/HDFS-16332 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, dfs, dfsclient > Affects Versions: 2.8.5, 3.3.1 > Reporter: Shinya Yoshida > Assignee: Shinya Yoshida > Priority: Major > Labels: pull-request-available > Attachments: Screenshot from 2021-11-18 12-11-34.png, Screenshot from > 2021-11-18 12-14-29.png, Screenshot from 2021-11-18 13-31-35.png > > Time Spent: 5h 20m > Remaining Estimate: 0h > > We're operating the HBase 1.4.x cluster on Hadoop 2.8.5. > We're recently evaluating Kerberos secured HBase and Hadoop cluster with > production load and we observed HBase's response slows >= several seconds, > and about several minutes for worst-case (about once~three times a month). > The following image is a scatter plot of HBase's response slow, each circle > is each base's slow response log. > The X-axis is the date time of the log occurred, the Y-axis is the response > slow time. > !Screenshot from 2021-11-18 12-14-29.png! > We could reproduce this issue by reducing "dfs.block.access.token.lifetime" > and we could figure out the cause. > (We used dfs.block.access.token.lifetime=60, i.e. 1 hour) > When hedged read enabled: > !Screenshot from 2021-11-18 12-11-34.png! > When hedged read disabled: > !Screenshot from 2021-11-18 13-31-35.png! > As you can see, it's worst if the hedged read is enabled. However, it happens > whether the hedged read is enabled or not. > This impacts our 99%tile response time. > This happens when the block token is expired and the root cause is the wrong > handling of the InvalidToken exception in sasl handshake in > SaslDataTransferServer. > I propose to add a new response code for DataTransferEncryptorStatus to > request the client to update the block token like DataTransferProtos does. > The test code and patch is available in > https://github.com/apache/hadoop/pull/3677 > We could reproduce this issue by the following test code in 2.8.5 branch and > trunk as I tested > {code:java} > // HDFS is configured as secure cluster > try (FileSystem fs = newFileSystem(); > FSDataInputStream in = fs.open(PATH)) { > waitBlockTokenExpired(in); > in.read(0, bytes, 0, bytes.length) > } > private void waitBlockTokenExpired(FSDataInputStream in1) throws Exception { > DFSInputStream innerStream = (DFSInputStream) in1.getWrappedStream(); > for (LocatedBlock block : innerStream.getAllBlocks()) { > while (!SecurityTestUtil.isBlockTokenExpired(block.getBlockToken())) { > Thread.sleep(100); > } > } > } > {code} > Here is the log we got, we added a custom log before and after the block > token refresh: > https://github.com/bitterfox/hadoop/commit/173a9f876f2264b76af01d658f624197936fd79c > {code} > 2021-11-16 09:40:20,330 WARN [hedgedRead-247] impl.BlockReaderFactory: I/O > error constructing remote block reader. > java.io.IOException: DIGEST-MD5: IO error acquiring password > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:420) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:475) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:389) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160) > at > org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:568) > at > org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2880) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:815) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:740) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:385) > at > org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:697) > at > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1237) > at > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1205) > at > org.apache.hadoop.hdfs.DFSInputStream.access$000(DFSInputStream.java:98) > at > org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1189) > at > org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1181) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2021-11-16 09:40:20,331 WARN [hedgedRead-247] hdfs.DFSClient: Connection > failure: Failed to connect to /10.10.10.1:12345 for file > /hbase/data/default/test_table/<encoded-region-name>/o/<store-file> for block > BP-123456789-10.20.20.1-1629777195910:blk_9876543212_1357924:java.io.IOException: > DIGEST-MD5: IO error acquiring password > java.io.IOException: DIGEST-MD5: IO error acquiring password > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:420) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:475) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:389) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160) > at > org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:568) > at > org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2880) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:815) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:740) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:385) > at > org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:697) > at > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1237) > at > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1205) > at > org.apache.hadoop.hdfs.DFSInputStream.access$000(DFSInputStream.java:98) > at > org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1189) > at > org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1181) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2021-11-16 09:40:20,332 WARN [hedgedRead-247] impl.BlockReaderFactory: I/O > error constructing remote block reader. > java.io.IOException: DIGEST-MD5: IO error acquiring password > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:420) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:475) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:389) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160) > at > org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:568) > at > org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2880) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:815) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:740) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:385) > at > org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:697) > at > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1237) > at > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1205) > at > org.apache.hadoop.hdfs.DFSInputStream.access$000(DFSInputStream.java:98) > at > org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1189) > at > org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1181) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2021-11-16 09:40:20,332 WARN [hedgedRead-247] hdfs.DFSClient: Connection > failure: Failed to connect to /10.10.10.2:12345 for file > /hbase/data/default/test_table/<encoded-region-name>/o/<store-file> for block > BP-123456789-10.20.20.1-1629777195910:blk_9876543212_1357924:java.io.IOException: > DIGEST-MD5: IO error acquiring password > java.io.IOException: DIGEST-MD5: IO error acquiring password > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:420) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:475) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:389) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160) > at > org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:568) > at > org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2880) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:815) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:740) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:385) > at > org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:697) > at > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1237) > at > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1205) > at > org.apache.hadoop.hdfs.DFSInputStream.access$000(DFSInputStream.java:98) > at > org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1189) > at > org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1181) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2021-11-16 09:40:20,333 WARN [hedgedRead-247] impl.BlockReaderFactory: I/O > error constructing remote block reader. > java.io.IOException: DIGEST-MD5: IO error acquiring password > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:420) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:475) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:389) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160) > at > org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:568) > at > org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2880) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:815) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:740) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:385) > at > org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:697) > at > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1237) > at > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1205) > at > org.apache.hadoop.hdfs.DFSInputStream.access$000(DFSInputStream.java:98) > at > org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1189) > at > org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1181) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2021-11-16 09:40:20,334 WARN [hedgedRead-247] hdfs.DFSClient: Connection > failure: Failed to connect to /10.10.10.3:12345 for file > /hbase/data/default/test_table/<encoded-region-name>/o/<store-file> for block > BP-123456789-10.20.20.1-1629777195910:blk_9876543212_1357924:java.io.IOException: > DIGEST-MD5: IO error acquiring password > java.io.IOException: DIGEST-MD5: IO error acquiring password > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.readSaslMessageAndNegotiatedCipherOption(DataTransferSaslUtil.java:420) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:475) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:389) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160) > at > org.apache.hadoop.hdfs.DFSUtilClient.peerFromSocketAndKey(DFSUtilClient.java:568) > at > org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:2880) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:815) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:740) > at > org.apache.hadoop.hdfs.client.impl.BlockReaderFactory.build(BlockReaderFactory.java:385) > at > org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:697) > at > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1237) > at > org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1205) > at > org.apache.hadoop.hdfs.DFSInputStream.access$000(DFSInputStream.java:98) > at > org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1189) > at > org.apache.hadoop.hdfs.DFSInputStream$2.call(DFSInputStream.java:1181) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2021-11-16 09:40:20,334 WARN > [RpcServer.default.RWQ.Fifo.read.handler=363,queue=28,port=10001] > hdfs.DFSClient: No live nodes contain block > BP-123456789-10.20.20.1-1629777195910:blk_9876543212_1357924 after checking > nodes = [DatanodeInfoWithStorage[10.10.10.1:12345,DS-*****,DISK], > DatanodeInfoWithStorage[10.10.10.2:12345,DS-*****,DISK], > DatanodeInfoWithStorage[10.10.10.3:12345,DS-*****,DISK]], ignoredNodes = > [DatanodeInfoWithStorage[10.10.10.1:12345,DS-*****,DISK], > DatanodeInfoWithStorage[10.10.10.2:12345,DS-*****,DISK], > DatanodeInfoWithStorage[10.10.10.3:12345,DS-*****,DISK]] > 2021-11-16 09:40:20,334 INFO > [RpcServer.default.RWQ.Fifo.read.handler=363,queue=28,port=10001] > hdfs.DFSClient: Could not obtain > BP-123456789-10.20.20.1-1629777195910:blk_9876543212_1357924 from any node: > No live nodes contain current block Block locations: > DatanodeInfoWithStorage[10.10.10.1:12345,DS-*****,DISK] > DatanodeInfoWithStorage[10.10.10.2:12345,DS-*****,DISK] > DatanodeInfoWithStorage[10.10.10.3:12345,DS-*****,DISK] Dead nodes: > DatanodeInfoWithStorage[10.10.10.1:12345,DS-*****,DISK] > DatanodeInfoWithStorage[10.10.10.2:12345,DS-*****,DISK] > DatanodeInfoWithStorage[10.10.10.3:12345,DS-*****,DISK] Ignored nodes: > DatanodeInfoWithStorage[10.10.10.1:12345,DS-*****,DISK] > DatanodeInfoWithStorage[10.10.10.2:12345,DS-*****,DISK] > DatanodeInfoWithStorage[10.10.10.3:12345,DS-*****,DISK]. Will get new block > locations from namenode and retry... > 2021-11-16 09:40:20,334 WARN > [RpcServer.default.RWQ.Fifo.read.handler=363,queue=28,port=10001] > hdfs.DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for > 523.0985342660828 msec. > 2021-11-16 09:40:20,860 WARN > [RpcServer.default.RWQ.Fifo.read.handler=363,queue=28,port=10001] > hdfs.DFSClient: No live nodes contain block > BP-123456789-10.20.20.1-1629777195910:blk_9876543212_1357924 after checking > nodes = [DatanodeInfoWithStorage[10.10.10.2:12345,DS-*****,DISK], > DatanodeInfoWithStorage[10.10.10.1:12345,DS-*****,DISK], > DatanodeInfoWithStorage[10.10.10.3:12345,DS-*****,DISK]], ignoredNodes = > [DatanodeInfoWithStorage[10.10.10.1:12345,DS-*****,DISK], > DatanodeInfoWithStorage[10.10.10.2:12345,DS-*****,DISK], > DatanodeInfoWithStorage[10.10.10.3:12345,DS-*****,DISK]] > 2021-11-16 09:40:20,860 INFO > [RpcServer.default.RWQ.Fifo.read.handler=363,queue=28,port=10001] > hdfs.DFSClient: Could not obtain > BP-123456789-10.20.20.1-1629777195910:blk_9876543212_1357924 from any node: > No live nodes contain current block Block locations: > DatanodeInfoWithStorage[10.10.10.2:12345,DS-*****,DISK] > DatanodeInfoWithStorage[10.10.10.1:12345,DS-*****,DISK] > DatanodeInfoWithStorage[10.10.10.3:12345,DS-*****,DISK] Dead nodes: Ignored > nodes: DatanodeInfoWithStorage[10.10.10.1:12345,DS-*****,DISK] > DatanodeInfoWithStorage[10.10.10.2:12345,DS-*****,DISK] > DatanodeInfoWithStorage[10.10.10.3:12345,DS-*****,DISK]. Will get new block > locations from namenode and retry... > 2021-11-16 09:40:20,860 WARN > [RpcServer.default.RWQ.Fifo.read.handler=363,queue=28,port=10001] > hdfs.DFSClient: DFS chooseDataNode: got # 2 IOException, will wait for > 8025.758935908773 msec. > 2021-11-16 09:40:28,887 INFO > [RpcServer.default.RWQ.Fifo.read.handler=363,queue=28,port=10001] > hdfs.DFSClient: Could not obtain > BP-123456789-10.20.20.1-1629777195910:blk_9876543212_1357924 from any node: > No live nodes contain current block Block locations: > DatanodeInfoWithStorage[10.10.10.2:12345,DS-*****, > DISK] DatanodeInfoWithStorage[10.10.10.1:12345,DS-*****,DISK] > DatanodeInfoWithStorage[10.10.10.3:12345,DS-*****,DISK] Dead nodes: Ignored > nodes: DatanodeInfoWithStorage[10.10.10.1:12345,DS-*****,DISK] > DatanodeInfoWithStorage[10.10.10.2:1146 > 2,DS-*****,DISK] DatanodeInfoWithStorage[10.10.10.3:12345,DS-*****,DISK]. > Will get new block locations from namenode and retry... > 2021-11-16 09:40:28,887 WARN > [RpcServer.default.RWQ.Fifo.read.handler=363,queue=28,port=10001] > hdfs.DFSClient: DFS chooseDataNode: got # 3 IOException, will wait for > 7995.183785064122 msec. > 2021-11-16 09:40:59,922 WARN > [RpcServer.default.RWQ.Fifo.read.handler=371,queue=36,port=10001] > ipc.RpcServer: (responseTooSlow): > {"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiRequest)","multi.gets":3,"starttimems":"1637023220329","responsesize":"64393","method":"Multi","param":"region\u003d > test_table,***,1631095286710.<encoded-region-name>., for 3 action(s) and 1st > row > keyTRUNCATED","processingtimems":39592,"client":"10.30.30.1:56789","queuetimems":0,"multi.servicecalls":0,"class":"HRegionServer","multi.mutations":0} > {code} > As you can see, you see the IOException and then all datanodes are considered > dead nodes. > Also, you couldn't see the block token refresh occurring. > So the logic of refresh block token isn't performed for some cases and all > datanodes are marked as dead and then chooseDataNode and refetchLocations is > triggered with the sleep. > refetchLocations sleeps up to `dfsClient.getConf().getTimeWindow()` default > is 3 second for first failure. > https://github.com/apache/hadoop/blob/91af256a5b44925e5dfdf333293251a19685ba2a/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L982-L1046 > That's why we see slows 0~3 secs for hedged read disabled. > refetchLocations clears dead nodes, but ignored node that managed in > hedgedFetchBlockByteRange are not cleared, so hedgedFetchBlockByteRange tries > refetchLocations many times up to > `dfsClient.getConf().getMaxBlockAcquireFailures()` > (sleep in refetchLocations is computed by `timeWindow * failure + timeWindow > * (failure + 1) * nextDouble()` and that's why we see several minutes > response slow when hedged read is enabled) > After these retries, BlockMissingException is thrown. > https://github.com/apache/hadoop/blob/91af256a5b44925e5dfdf333293251a19685ba2a/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java#L1343-L1386 > We dig into the IOException stacktrace and we found sasl handshake returns an > error. > We added the log in SaslDataTransferServer side: > https://github.com/bitterfox/hadoop/tree/saslhandshake-error-log > and then we got the following DN error: > {code} > 2021-11-16 16:11:06,480 ERROR > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer: > Generic sasl error for client /10.10.10.4:45678 > javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password > [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Block > token with block_token_identifier (expiryDate=1637046306844, keyId=<keyid>, > userId=hbase, blockPoolId=BP-123456789-10.20.20.1-1629777195910, > blockId=<blockid>, access modes=[READ]) is expired.] > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:598) > at > com.sun.security.sasl.digest.DigestMD5Server.evaluateResponse(DigestMD5Server.java:244) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslParticipant.evaluateChallengeOrResponse(SaslParticipant.java:115) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.doSaslHandshake(SaslDataTransferServer.java:376) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.getSaslStreams(SaslDataTransferServer.java:300) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.receive(SaslDataTransferServer.java:127) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:231) > at java.lang.Thread.run(Thread.java:748) > Caused by: org.apache.hadoop.security.token.SecretManager$InvalidToken: Block > token with block_token_identifier (expiryDate=1637046306844, keyId=<keyid>, > userId=hbase, blockPoolId=BP-123456789-10.20.20.1-1629777195910, > blockId=<blockid>, access modes=[READ]) is expired. > at > org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager.retrievePassword(BlockTokenSecretManager.java:377) > at > org.apache.hadoop.hdfs.security.token.block.BlockPoolTokenSecretManager.retrievePassword(BlockPoolTokenSecretManager.java:80) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.buildServerPassword(SaslDataTransferServer.java:318) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer.access$100(SaslDataTransferServer.java:73) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$2.apply(SaslDataTransferServer.java:297) > at > org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferServer$SaslServerCallbackHandler.handle(SaslDataTransferServer.java:241) > at > com.sun.security.sasl.digest.DigestMD5Server.validateClientResponse(DigestMD5Server.java:589) > ... 7 more > {code} > As you can see the expired token is used and retrievePassword used for sasl > throws InvalidToken exception. > retrievePassword: > https://github.com/apache/hadoop/blob/91af256a5b44925e5dfdf333293251a19685ba2a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/security/token/block/BlockTokenSecretManager.java#L501-L506 > So if a connection is established newly after the block token is expired, > this issue happens. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org