[ https://issues.apache.org/jira/browse/HDFS-7060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14391501#comment-14391501 ]
Colin Patrick McCabe edited comment on HDFS-7060 at 4/1/15 9:31 PM: -------------------------------------------------------------------- While I agree that making the DataNode heartbeat lockless might be a good idea in theory, it is not a bugfix. It is an optimization or improvement. In contrast, the issue with {{FsDatasetImpl#createTemporary}} holding the lock for a long period is a bug, and it needs to be fixed. Even if the heartbeat is made lockless, HDFS-7999 will still cause many problems such as extremely high write and read latency, unresponsive datanodes, and so on. Making the heartbeat lockless does not fix HDFS-7999, although it might alleviate a few of the symptoms. Edit: I see that Haohui explained why we might be able to remove some of these locks. I'll try to take a look later was (Author: cmccabe): While I agree that making the DataNode heartbeat lockless might be a good idea in theory, it is not a bugfix. It is an optimization or improvement. In contrast, the issue with {{FsDatasetImpl#createTemporary}} holding the lock for a long period is a bug, and it needs to be fixed. Even if the heartbeat is made lockless, HDFS-7999 will still cause many problems such as extremely high write and read latency, unresponsive datanodes, and so on. Making the heartbeat lockless does not fix HDFS-7999, although it might alleviate a few of the symptoms. The patch here seems to remove a bunch of synchronized blocks, but I don't see any explanation for how we will avoid race conditions associated with concurrent access that these synchronized blocks were protecting against. This needs to be be much better thought-out before we can consider it. > Contentions of the monitor of FsDatasetImpl block DN's heartbeat > ---------------------------------------------------------------- > > Key: HDFS-7060 > URL: https://issues.apache.org/jira/browse/HDFS-7060 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Haohui Mai > Assignee: Xinwei Qin > Priority: Critical > Attachments: HDFS-7060-002.patch, HDFS-7060.000.patch, > HDFS-7060.001.patch > > > We're seeing the heartbeat is blocked by the monitor of {{FsDatasetImpl}} > when the DN is under heavy load of writes: > {noformat} > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:115) > - waiting to lock <0x0000000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:91) > - locked <0x0000000780612fd8> (a java.lang.Object) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:563) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:668) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:827) > at java.lang.Thread.run(Thread.java:744) > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:743) > - waiting to lock <0x0000000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) > java.lang.Thread.State: RUNNABLE > at java.io.UnixFileSystem.createFileExclusively(Native Method) > at java.io.File.createNewFile(File.java:1006) > at > org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createTmpFile(DatanodeUtil.java:59) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:244) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:195) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:753) > - locked <0x0000000780304fb8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:60) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:169) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:621) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:124) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:232) > at java.lang.Thread.run(Thread.java:744) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)