[
https://issues.apache.org/jira/browse/HADOOP-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12490156
]
eric baldeschwieler commented on HADOOP-1170:
---------------------------------------------
The thing to understand is that we can not upgrade our cluster to HEAD with
this patch committed. This patch breaks us. We'll try to move forward in the
new issue rather than advocating rolling this back, but this patch did not
address the concerns we raised in this bug and so we have a problem. I hope we
can avoid this in the future.
I'm not advocating rolling back because I agree that these checks were not the
appropriate solution to the disk problems they solved.
In case the context isn't clear, we frequently see individual drives go read
only on our machines. This check was inserted to allow this problem to be
detected early and avoid failed jobs cause by write failures.
> Very high CPU usage on data nodes because of FSDataset.checkDataDir() on
> every connect
> --------------------------------------------------------------------------------------
>
> Key: HADOOP-1170
> URL: https://issues.apache.org/jira/browse/HADOOP-1170
> Project: Hadoop
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.11.2
> Reporter: Igor Bolotin
> Fix For: 0.13.0
>
> Attachments: 1170-v2.patch, 1170.patch
>
>
> While investigating performance issues in our Hadoop DFS/MapReduce cluster I
> saw very high CPU usage by DataNode processes.
> Stack trace showed following on most of the data nodes:
> "[EMAIL PROTECTED]" daemon prio=1 tid=0x00002aaacb5b7bd0 nid=0x5940 runnable
> [0x000000004166a000..0x000000004166ac00]
> at java.io.UnixFileSystem.checkAccess(Native Method)
> at java.io.File.canRead(File.java:660)
> at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:34)
> at
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:164)
> at
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
> at
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
> at
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
> at
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
> at
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
> at
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
> at
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
> at
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
> at
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
> at
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
> at
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
> at
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
> at
> org.apache.hadoop.dfs.FSDataset$FSDir.checkDirTree(FSDataset.java:168)
> at
> org.apache.hadoop.dfs.FSDataset$FSVolume.checkDirs(FSDataset.java:258)
> at
> org.apache.hadoop.dfs.FSDataset$FSVolumeSet.checkDirs(FSDataset.java:339)
> - locked <0x00002aaab6fb8960> (a
> org.apache.hadoop.dfs.FSDataset$FSVolumeSet)
> at org.apache.hadoop.dfs.FSDataset.checkDataDir(FSDataset.java:544)
> at
> org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:535)
> at java.lang.Thread.run(Thread.java:595)
> I understand that it would take a while to check the entire data directory -
> as we have some 180,000 blocks/files in there. But what really bothers me
> that from the code I see that this check is executed for every client
> connection to the DataNode - which also means for every task executed in the
> cluster. Once I commented out the check and restarted datanodes - the
> performance went up and CPU usage went down to reasonable level.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.