[ https://issues.apache.org/jira/browse/HDFS-14536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Chow updated HDFS-14536: ----------------------------- Attachment: HDFS-14536.00.patch > It's safe to remove synchronized from reconcile scan > ---------------------------------------------------- > > Key: HDFS-14536 > URL: https://issues.apache.org/jira/browse/HDFS-14536 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode > Affects Versions: 2.6.0, 2.7.0 > Reporter: Sean Chow > Priority: Major > Attachments: HDFS-14536.00.patch > > > I found all my my request blocked in every 6 hours and use jstack to dump > threads: > {code:java} > "1490900528@qtp-341796579-35" #132086969 daemon prio=5 os_prio=0 > tid=0x00007f7ed01c3800 nid=0x1a1c waiting for monitor entry > [0x00007f7e9e42b000] > java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:295) > - waiting to lock <0x00000002c00410f8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getVolumeInfo(FsDatasetImpl.java:2323) > /// lots of BLOCKED for about one minute > "java.util.concurrent.ThreadPoolExecutor$Worker@49d8fff9[State = -1, empty > queue]" #77 daemon prio=5 os_prio=0 tid=0x00007f7ed8590800 nid=0x49cc > runnable [0x00007f7ea0344000] > java.lang.Thread.State: RUNNABLE > at > org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.idToBlockDir(DatanodeUtil.java:120) > at > org.apache.hadoop.hdfs.server.datanode.ReplicaInfo.getDir(ReplicaInfo.java:143) > at > org.apache.hadoop.hdfs.server.datanode.ReplicaInfo.<init>(ReplicaInfo.java:95) > at > org.apache.hadoop.hdfs.server.datanode.FinalizedReplica.<init>(FinalizedReplica.java:60) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getFinalizedBlocks(FsDatasetImpl.java:1510) > - locked <0x00000002c00410f8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:437) > - locked <0x00000002c00410f8> (a > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) > at > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:404) > at > org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:360) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > This problem may relate to lots of blocks in every datanode. > After reviewing the DirectoryScanner behavior, I think the lock > {{synchronized(dataset)}} in {{scan()}} can be removed safely. > As dataset.getFinalizedBlocks() is a synchronized method, the memReport will > be generated from volumeMap with lock. After that , the outside lock > {{synchronized(dataset)}} is needless but I've encounter the later while loop > running 20 seconds > Yes, other blocks could be added or deleted from memory or disk, but it has > no affect to get the differences between them. Because the later fixing > processing will handle both sides. > > {code:java} > void scan() { > clear(); > Map<String, ScanInfo[]> diskReport = getDiskReport(); > // Hold FSDataset lock to prevent further changes to the block map > synchronized(dataset) { // remove synchronized? > for (Entry<String, ScanInfo[]> entry : diskReport.entrySet()) { > String bpid = entry.getKey(); > ScanInfo[] blockpoolReport = entry.getValue(); > Stats statsRecord = new Stats(bpid); > stats.put(bpid, statsRecord); > LinkedList<ScanInfo> diffRecord = new LinkedList<ScanInfo>(); > diffs.put(bpid, diffRecord); > statsRecord.totalBlocks = blockpoolReport.length; > // getFinalizedBlocks is synchronized > List<FinalizedReplica> bl = dataset.getFinalizedBlocks(bpid); > FinalizedReplica[] memReport = bl.toArray(new > FinalizedReplica[bl.size()]); > Arrays.sort(memReport); // Sort based on blockId > int d = 0; // index for blockpoolReport > int m = 0; // index for memReprot > while (m < memReport.length && d < blockpoolReport.length) { > // may take 20s to finish this loop > } > } //end for > } //end synchronized > }{code} > > Let say block A is added and block B is deleted after > {{dataset.getFinalizedBlocks(bpid)}} called. That means the disk file and > memory is updated. Because we have a memory finalized blocks snapshot before, > and just care about those blocks consistence. > If the difference report occasionally has that deleted block B, the later > {{checkAndUpdate}} will handle it in the right way. > I've compile this patch and try will test it in my enviroment. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org