[ https://issues.apache.org/jira/browse/HDFS-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15229144#comment-15229144 ]
Wei-Chiu Chuang commented on HDFS-8496: --------------------------------------- Link HDFS-10267, which founds/fixes a bug in HDFS-8496. > Calling stopWriter() with FSDatasetImpl lock held may block other threads > ------------------------------------------------------------------------- > > Key: HDFS-8496 > URL: https://issues.apache.org/jira/browse/HDFS-8496 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 2.6.0 > Reporter: zhouyingchao > Assignee: Colin Patrick McCabe > Fix For: 2.8.0 > > Attachments: HDFS-8496-001.patch, HDFS-8496.002.patch, > HDFS-8496.003.patch, HDFS-8496.004.patch > > > On a DN of a HDFS 2.6 cluster, we noticed some DataXceiver threads and > heartbeat threads are blocked for quite a while on the FSDatasetImpl lock. By > looking at the stack, we found the calling of stopWriter() with FSDatasetImpl > lock blocked everything. > Following is the heartbeat stack, as an example, to show how threads are > blocked by FSDatasetImpl lock: > {code} > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:152) > - waiting to lock <0x00000007701badc0> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getAvailable(FsVolumeImpl.java:191) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:144) > - locked <0x0000000770465dc0> (a java.lang.Object) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:575) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:680) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:850) > at java.lang.Thread.run(Thread.java:662) > {code} > The thread which held the FSDatasetImpl lock is just sleeping to wait another > thread to exit in stopWriter(). The stack is: > {code} > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Thread.join(Thread.java:1194) > - locked <0x00000007636953b8> (a org.apache.hadoop.util.Daemon) > at > org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:183) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.recoverCheck(FsDatasetImpl.java:982) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.recoverClose(FsDatasetImpl.java:1026) > - locked <0x00000007701badc0> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:624) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235) > at java.lang.Thread.run(Thread.java:662) > {code} > In this case, we deployed quite a lot other workloads on the DN, the local > file system and disk is quite busy. We guess this is why the stopWriter took > quite a long time. > Any way, it is not quite reasonable to call stopWriter with the FSDatasetImpl > lock held. In HDFS-7999, the createTemporary() is changed to call > stopWriter without FSDatasetImpl lock. We guess we should do so in the other > three methods: recoverClose()/recoverAppend/recoverRbw(). > I'll try to finish a patch for this today. -- This message was sent by Atlassian JIRA (v6.3.4#6332)