[ 
https://issues.apache.org/jira/browse/HADOOP-4780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654346#action_12654346
 ] 

Joydeep Sen Sarma commented on HADOOP-4780:
-------------------------------------------

ignore my last comment. i am not sure what's wrong - but i can't replicate that 
result.

anyway:

- with du --summarize:
real    0m0.771s
user    0m0.113s
sys     0m0.649s

- with current getDU implementaion:


real    0m7.368s
user    0m3.552s
sys     0m3.737s

- with current getDU and not following symlinks (so make no difference)

real    0m6.070s
user    0m3.230s
sys     0m3.605s

but another problem i am beginning to realize is that why there are so many 
files to begin with in the distributed cache. i had previously measured on a 
random node - but looking at a problematic node:

ls -lR /mnt/d1/mapred/local/taskTracker/jobcache|wc -l
206479

wow! doing a ls -lrt:

total 2944
drwxr-xr-x   3 root root 4096 Nov 25 14:31 job_200811251239_0375
drwxr-xr-x   2 root root 4096 Nov 25 14:36 job_200811251239_0395
drwxr-xr-x   3 root root 4096 Nov 25 14:39 job_200811251239_0416

hmmm .. this is many many days old. something is wrong. is there another known 
problem with things not getting deleted?

one theory is that once this directory gets too big - there's no opportunity to 
clean out the dir (since task spawns fail or get beaten by speculative tasks on 
other nodes while doing a getDU()). or is there any other known bug in 
hadoop-01.7 with jobcache not getting cleared out? (on the bright side - at 
least we know how to fix these nodes even without a software fix)
















> Task Tracker  burns a lot of cpu in calling getLocalCache
> ---------------------------------------------------------
>
>                 Key: HADOOP-4780
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4780
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.19.0
>            Reporter: Runping Qi
>         Attachments: 4780.patch
>
>
> I noticed that many times, a task tracker max up to 6 cpus.
> During that time, iostat shows majority of that was  system cpu.
> That situation can last for quite long.
> During that time, I saw a number of threads were in the following state:
>   java.lang.Thread.State: RUNNABLE
>         at java.io.UnixFileSystem.getBooleanAttributes0(Native Method)
>         at 
> java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:228)
>         at java.io.File.exists(File.java:733)
>         at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:399)
>         at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
>         at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
>         at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
>         at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
>         at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
>         at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
>         at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
>         at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
>         at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
>         at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
>         at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
>         at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
>         at org.apache.hadoop.fs.FileUtil.getDU(FileUtil.java:407)
>         at 
> org.apache.hadoop.filecache.DistributedCache.getLocalCache(DistributedCache.java:176)
>         at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:140)
> I suspect that getLocalCache is too expensive.
> And calling it for every task initialization seems too much waste.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to