[ https://issues.apache.org/jira/browse/HDFS-11260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Junping Du updated HDFS-11260: ------------------------------ Target Version/s: 3.0.0-beta1 (was: 2.8.0, 3.0.0-beta1) > Slow writer threads are not stopped > ----------------------------------- > > Key: HDFS-11260 > URL: https://issues.apache.org/jira/browse/HDFS-11260 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode > Affects Versions: 2.7.0 > Environment: CDH5.8.0 > Reporter: Wei-Chiu Chuang > Assignee: Wei-Chiu Chuang > > If a DataNode receives a transferred block, it tries to stop writer to the > same block. However, this may not work, and we saw the following error > message and stacktrace. > Fundamentally, the assumption of {{ReplicaInPipeline#stopWriter}} is wrong. > It assumes the writer thread must be a DataXceiver thread, which it can be > interrupted and terminates afterwards. However, IPC threads may also be the > writer thread by calling initReplicaRecovery, and which ignores interrupt and > do not terminate. > {noformat} > 2016-12-16 19:58:56,167 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Join on writer thread Thread[IPC Server handler 6 on 50020,5,main] timed out > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082) > java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467) > org.apache.hadoop.ipc.CallQueueManager.take(CallQueueManager.java:135) > org.apache.hadoop.ipc.Server$Handler.run(Server.java:2052) > 2016-12-16 19:58:56,167 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > IOException in BlockReceiver constructor. Cause is > 2016-12-16 19:58:56,168 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > sj1dra082.corp.adobe.com:50010:DataXceiver error processing WRITE_BLOCK > operation src: /10.10.0.80:44105 dst: /10.10.0.82:50010 > java.io.IOException: Join on writer thread Thread[IPC Server handler 6 on > 50020,5,main] timed out > at > org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:212) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1579) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:195) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:669) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:169) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:106) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:246) > at java.lang.Thread.run(Thread.java:745) > {noformat} > There is also a logic error in FsDatasetImpl#createTemporary, in which if the > code in the synchronized block executes for more than 60 seconds (in theory), > it could throw an exception, without trying to stop the existing slow writer. > We saw a FsDatasetImpl#createTemporary failed after nearly 10 minutes, and > it's unclear why yet. It's my understanding that the code intends to stop > slow writers after 1 minute by default. Some code rewrite is probably needed > to get the logic right. > {noformat} > 2016-12-16 23:12:24,636 WARN > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Unable > to stop existing writer for block > BP-1527842723-10.0.0.180-1367984731269:blk_4313782210_1103780331023 after > 568320 miniseconds. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org