Aside from cleanup, it seems like you are running into max number of 
subdirectories per directory on ext3.

Joep

Sent from my iPhone

On Mar 6, 2012, at 10:22 AM, Chris Curtin <curtin.ch...@gmail.com> wrote:

> Hi,
> 
> We had a fun morning trying to figure out why our cluster was failing jobs,
> removing nodes from the cluster etc. The majority of the errors were
> something like:
> 
> 
> Error initializing attempt_201203061035_0047_m_000002_0:
> 
> org.apache.hadoop.util.Shell$ExitCodeException: chmod: cannot access
> `/disk1/userlogs/job_201203061035_0047': No such file or directory
> 
> 
> 
>                at org.apache.hadoop.util.Shell.runCommand(Shell.java:255)
> 
>                at org.apache.hadoop.util.Shell.run(Shell.java:182)
> 
>                at
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:375)
> 
>                at org.apache.hadoop.util.Shell.execCommand(Shell.java:461)
> 
>                at org.apache.hadoop.util.Shell.execCommand(Shell.java:444)
> 
>                at
> org.apache.hadoop.fs.RawLocalFileSystem.execCommand(RawLocalFileSystem.java:533)
> 
>                at
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:524)
> 
>                at
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
> 
>                at
> org.apache.hadoop.mapred.JobLocalizer.initializeJobLogDir(JobLocalizer.java:240)
> 
>                at
> org.apache.hadoop.mapred.DefaultTaskController.initializeJob(DefaultTaskController.java:216)
> 
>                at
> org.apache.hadoop.mapred.TaskTracker$4.run(TaskTracker.java:1352)
> 
> 
> 
> Finally we shutdown the entire cluster and found that the 'userlogs'
> directory on the failed nodes had 30,000+ directories and the 'live' nodes
> 25,000+. Looking at creation timestamps it looks like around adding
> 30,000th directory the node falls over.
> 
> 
> 
> Many of the directorys are weeks old and a few were months old.
> 
> 
> 
> Deleting ALL the directories on all the nodes allowed us to bring the
> cluster up and things to run again. (Some users are claiming it is running
> faster now?)
> 
> 
> 
> Our question: what is supposed to be cleaning up these directories? How
> often is that process or step taken?
> 
> 
> 
> We are running CDH3u3.
> 
> 
> 
> Thanks,
> 
> 
> 
> Chris

Reply via email to