Aside from cleanup, it seems like you are running into max number of subdirectories per directory on ext3.
Joep Sent from my iPhone On Mar 6, 2012, at 10:22 AM, Chris Curtin <curtin.ch...@gmail.com> wrote: > Hi, > > We had a fun morning trying to figure out why our cluster was failing jobs, > removing nodes from the cluster etc. The majority of the errors were > something like: > > > Error initializing attempt_201203061035_0047_m_000002_0: > > org.apache.hadoop.util.Shell$ExitCodeException: chmod: cannot access > `/disk1/userlogs/job_201203061035_0047': No such file or directory > > > > at org.apache.hadoop.util.Shell.runCommand(Shell.java:255) > > at org.apache.hadoop.util.Shell.run(Shell.java:182) > > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375) > > at org.apache.hadoop.util.Shell.execCommand(Shell.java:461) > > at org.apache.hadoop.util.Shell.execCommand(Shell.java:444) > > at > org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:533) > > at > org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:524) > > at > org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344) > > at > org.apache.hadoop.mapred.JobLocalizer.initializeJobLogDir(JobLocalizer.java:240) > > at > org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:216) > > at > org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1352) > > > > Finally we shutdown the entire cluster and found that the 'userlogs' > directory on the failed nodes had 30,000+ directories and the 'live' nodes > 25,000+. Looking at creation timestamps it looks like around adding > 30,000th directory the node falls over. > > > > Many of the directorys are weeks old and a few were months old. > > > > Deleting ALL the directories on all the nodes allowed us to bring the > cluster up and things to run again. (Some users are claiming it is running > faster now?) > > > > Our question: what is supposed to be cleaning up these directories? How > often is that process or step taken? > > > > We are running CDH3u3. > > > > Thanks, > > > > Chris