As soon as job completes, your jobcache should be cleared. Check your mapred-site.xml for mapred.local.dir setting and make sure job cleanup step is successful in web UI. Setting your job's intermediate output setting to true will keep the jobcache folder smaller.
Artem Ervits Data Analyst New York Presbyterian Hospital From: Hemanth Yamijala [mailto:yhema...@thoughtworks.com] Sent: Thursday, January 10, 2013 07:37 AM To: user@hadoop.apache.org <user@hadoop.apache.org> Subject: Re: JobCache directory cleanup Hi, On Thu, Jan 10, 2013 at 5:17 PM, Ivan Tretyakov <itretya...@griddynamics.com<mailto:itretya...@griddynamics.com>> wrote: Thanks for replies! Hemanth, I could see following exception in TaskTracker log: https://issues.apache.org/jira/browse/MAPREDUCE-5 But I'm not sure if it is related to this issue. > Now, when a job completes, the directories under the jobCache must get > automatically cleaned up. However it doesn't look like this is happening in > your case. So, If I've no running jobs, jobcache directory should be empty. Is it correct? That is correct. I just verified it with my Hadoop 1.0.2 version Thanks Hemanth On Thu, Jan 10, 2013 at 8:18 AM, Hemanth Yamijala <yhema...@thoughtworks.com<mailto:yhema...@thoughtworks.com>> wrote: Hi, The directory name you have provided is /data?/mapred/local/taskTracker/persona/jobcache/. This directory is used by the TaskTracker (slave) daemons to localize job files when the tasks are run on the slaves. Hence, I don't think this is related to the parameter "mapreduce.jobtracker.retiredjobs.cache.size", which is a parameter related to the jobtracker process. Now, when a job completes, the directories under the jobCache must get automatically cleaned up. However it doesn't look like this is happening in your case. Could you please look at the logs of the tasktracker machine where this has gotten filled up to see if there are any errors that could give clues ? Also, since this is a CDH release, it could be a problem specific to that - and maybe reaching out on the CDH mailing lists will also help Thanks hemanth On Wed, Jan 9, 2013 at 8:11 PM, Ivan Tretyakov <itretya...@griddynamics.com<mailto:itretya...@griddynamics.com>> wrote: Hello! I've found that jobcache directory became very large on our cluster, e.g.: # du -sh /data?/mapred/local/taskTracker/user/jobcache 465G /data1/mapred/local/taskTracker/user/jobcache 464G /data2/mapred/local/taskTracker/user/jobcache 454G /data3/mapred/local/taskTracker/user/jobcache And it stores information for about 100 jobs: # ls -1 /data?/mapred/local/taskTracker/persona/jobcache/ | sort | uniq | wc -l I've found that there is following parameter: <property> <name>mapreduce.jobtracker.retiredjobs.cache.size</name> <value>1000</value> <description>The number of retired job status to keep in the cache. </description> </property> So, if I got it right it intended to control job cache size by limiting number of jobs to store cache for. Also, I've seen that some hadoop users uses cron approach to cleanup jobcache: http://grokbase.com/t/hadoop/common-user/102ax9bze1/cleaning-jobcache-manually (http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201002.mbox/%3c99484d561002100143s4404df98qead8f2cf687a7...@mail.gmail.com%3E) Are there other approaches to control jobcache size? What is more correct way to do it? Thanks in advance! P.S. We are using CDH 4.1.1. -- Best Regards Ivan Tretyakov Deployment Engineer Grid Dynamics +7 812 640 38 76 Skype: ivan.tretyakov www.griddynamics.com<http://www.griddynamics.com> itretya...@griddynamics.com<mailto:itretya...@griddynamics.com> -- Best Regards Ivan Tretyakov Deployment Engineer Grid Dynamics +7 812 640 38 76 Skype: ivan.tretyakov www.griddynamics.com<http://www.griddynamics.com> itretya...@griddynamics.com<mailto:itretya...@griddynamics.com> -------------------- This electronic message is intended to be for the use only of the named recipient, and may contain information that is confidential or privileged. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of the contents of this message is strictly prohibited. If you have received this message in error or are not the named recipient, please notify us immediately by contacting the sender at the electronic mail address noted above, and delete and destroy all copies of this message. Thank you. -------------------- This electronic message is intended to be for the use only of the named recipient, and may contain information that is confidential or privileged. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of the contents of this message is strictly prohibited. If you have received this message in error or are not the named recipient, please notify us immediately by contacting the sender at the electronic mail address noted above, and delete and destroy all copies of this message. Thank you.