[
https://issues.apache.org/jira/browse/HADOOP-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654346#action_12654346
]
Joydeep Sen Sarma commented on HADOOP-4780:
-------------------------------------------
ignore my last comment. i am not sure what's wrong - but i can't replicate that
result.
anyway:
- with du --summarize:
real 0m0.771s
user 0m0.113s
sys 0m0.649s
- with current getDU implementaion:
real 0m7.368s
user 0m3.552s
sys 0m3.737s
- with current getDU and not following symlinks (so make no difference)
real 0m6.070s
user 0m3.230s
sys 0m3.605s
but another problem i am beginning to realize is that why there are so many
files to begin with in the distributed cache. i had previously measured on a
random node - but looking at a problematic node:
ls -lR /mnt/d1/mapred/local/taskTracker/jobcache|wc -l
206479
wow! doing a ls -lrt:
total 2944
drwxr-xr-x 3 root root 4096 Nov 25 14:31 job_200811251239_0375
drwxr-xr-x 2 root root 4096 Nov 25 14:36 job_200811251239_0395
drwxr-xr-x 3 root root 4096 Nov 25 14:39 job_200811251239_0416
hmmm .. this is many many days old. something is wrong. is there another known
problem with things not getting deleted?
one theory is that once this directory gets too big - there's no opportunity to
clean out the dir (since task spawns fail or get beaten by speculative tasks on
other nodes while doing a getDU()). or is there any other known bug in
hadoop-01.7 with jobcache not getting cleared out? (on the bright side - at
least we know how to fix these nodes even without a software fix)
> Task Tracker burns a lot of cpu in calling getLocalCache
> ---------------------------------------------------------
>
> Key: HADOOP-4780
> URL: https://issues.apache.org/jira/browse/HADOOP-4780
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.19.0
> Reporter: Runping Qi
> Attachments: 4780.patch
>
>
> I noticed that many times, a task tracker max up to 6 cpus.
> During that time, iostat shows majority of that was system cpu.
> That situation can last for quite long.
> During that time, I saw a number of threads were in the following state:
> java.lang.Thread.State: RUNNABLE
> at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
> at
> java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:228)
> at java.io.File.exists(File.java:733)
> at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:399)
> at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
> at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
> at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
> at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
> at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
> at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
> at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
> at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
> at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
> at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
> at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
> at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
> at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
> at
> org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:176)
> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:140)
> I suspect that getLocalCache is too expensive.
> And calling it for every task initialization seems too much waste.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.