As soon as job completes, your jobcache should be cleared. Check your 
mapred-site.xml for mapred.local.dir setting and make sure job cleanup step is 
successful in web UI. Setting your job's intermediate output setting to true 
will keep the jobcache folder smaller.



Artem Ervits
Data Analyst
New York Presbyterian Hospital

From: Hemanth Yamijala [mailto:yhema...@thoughtworks.com]
Sent: Thursday, January 10, 2013 07:37 AM
To: user@hadoop.apache.org <user@hadoop.apache.org>
Subject: Re: JobCache directory cleanup

Hi,

On Thu, Jan 10, 2013 at 5:17 PM, Ivan Tretyakov 
<itretya...@griddynamics.com<mailto:itretya...@griddynamics.com>> wrote:
Thanks for replies!

Hemanth,
I could see following exception in TaskTracker log: 
https://issues.apache.org/jira/browse/MAPREDUCE-5
But I'm not sure if it is related to this issue.

> Now, when a job completes, the directories under the jobCache must get 
> automatically cleaned up. However it doesn't look like this is happening in 
> your case.

So, If I've no running jobs, jobcache directory should be empty. Is it correct?


That is correct. I just verified it with my Hadoop 1.0.2 version

Thanks
Hemanth



On Thu, Jan 10, 2013 at 8:18 AM, Hemanth Yamijala 
<yhema...@thoughtworks.com<mailto:yhema...@thoughtworks.com>> wrote:
Hi,

The directory name you have provided is 
/data?/mapred/local/taskTracker/persona/jobcache/. This directory is used by 
the TaskTracker (slave) daemons to localize job files when the tasks are run on 
the slaves.

Hence, I don't think this is related to the parameter 
"mapreduce.jobtracker.retiredjobs.cache.size", which is a parameter related to 
the jobtracker process.

Now, when a job completes, the directories under the jobCache must get 
automatically cleaned up. However it doesn't look like this is happening in 
your case.

Could you please look at the logs of the tasktracker machine where this has 
gotten filled up to see if there are any errors that could give clues ?
Also, since this is a CDH release, it could be a problem specific to that - and 
maybe reaching out on the CDH mailing lists will also help

Thanks
hemanth

On Wed, Jan 9, 2013 at 8:11 PM, Ivan Tretyakov 
<itretya...@griddynamics.com<mailto:itretya...@griddynamics.com>> wrote:
Hello!

I've found that jobcache directory became very large on our cluster, e.g.:

# du -sh /data?/mapred/local/taskTracker/user/jobcache
465G    /data1/mapred/local/taskTracker/user/jobcache
464G    /data2/mapred/local/taskTracker/user/jobcache
454G    /data3/mapred/local/taskTracker/user/jobcache

And it stores information for about 100 jobs:

# ls -1 /data?/mapred/local/taskTracker/persona/jobcache/  | sort | uniq | wc -l

I've found that there is following parameter:

<property>
  <name>mapreduce.jobtracker.retiredjobs.cache.size</name>
  <value>1000</value>
  <description>The number of retired job status to keep in the cache.
  </description>
</property>

So, if I got it right it intended to control job cache size by limiting number 
of jobs to store cache for.

Also, I've seen that some hadoop users uses cron approach to cleanup jobcache: 
http://grokbase.com/t/hadoop/common-user/102ax9bze1/cleaning-jobcache-manually 
(http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201002.mbox/%3c99484d561002100143s4404df98qead8f2cf687a7...@mail.gmail.com%3E)

Are there other approaches to control jobcache size?
What is more correct way to do it?

Thanks in advance!

P.S. We are using CDH 4.1.1.

--
Best Regards
Ivan Tretyakov

Deployment Engineer
Grid Dynamics
+7 812 640 38 76
Skype: ivan.tretyakov
www.griddynamics.com<http://www.griddynamics.com>
itretya...@griddynamics.com<mailto:itretya...@griddynamics.com>




--
Best Regards
Ivan Tretyakov

Deployment Engineer
Grid Dynamics
+7 812 640 38 76
Skype: ivan.tretyakov
www.griddynamics.com<http://www.griddynamics.com>
itretya...@griddynamics.com<mailto:itretya...@griddynamics.com>



--------------------

This electronic message is intended to be for the use only of the named 
recipient, and may contain information that is confidential or privileged.  If 
you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or use of the contents of this message is 
strictly prohibited.  If you have received this message in error or are not the 
named recipient, please notify us immediately by contacting the sender at the 
electronic mail address noted above, and delete and destroy all copies of this 
message.  Thank you.




--------------------

This electronic message is intended to be for the use only of the named 
recipient, and may contain information that is confidential or privileged.  If 
you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or use of the contents of this message is 
strictly prohibited.  If you have received this message in error or are not the 
named recipient, please notify us immediately by contacting the sender at the 
electronic mail address noted above, and delete and destroy all copies of this 
message.  Thank you.



Reply via email to