Wow. How many subdirectories were there? how many jobs do you run a day? - Aaron
On Wed, Jan 28, 2009 at 12:13 PM, David J. O'Dell <dod...@videoegg.com>wrote: > It was failing on all the nodes both new and old. > The problem was there were too many subdirectories under > $HADOOP_HOME/logs/userlogs > The fix was just to delete the subdirs and change this setting from 24 > hours(the default) to 2 hours. > mapred.userlog.retain.hours > > Would have been nice if there was an error message that pointed to this. > > > Aaron Kimball wrote: > > Hi David, > > > > If your tasks are failing on only the new nodes, it's likely that you're > > missing a library or something on those machines. See this Hadoop > tutorial > > http://public.yahoo.com/gogate/hadoop-tutorial/html/module5.html about > > "distributing debug scripts." These will allow you to capture stdout/err > and > > the syslog from tasks that fail. > > > > - Aaron > > > > On Wed, Jan 28, 2009 at 9:40 AM, Sagar Naik <sn...@attributor.com> > wrote: > > > > > >> Pl check which nodes have these failures. > >> > >> I guess the new tasktrackers/machines are not configured correctly. > >> As a result, the map-task will die and the remaining map-tasks will be > >> sucked onto these machines > >> > >> > >> -Sagar > >> > >> > >> David J. O'Dell wrote: > >> > >> > >>> We've been running 0.18.2 for over a month on an 8 node cluster. > >>> Last week we added 4 more nodes to the cluster and have experienced 2 > >>> failures to the tasktrackers since then. > >>> The namenodes are running fine but all jobs submitted will die when > >>> submitted with this error on the tasktrackers. > >>> > >>> 2009-01-28 08:07:55,556 INFO org.apache.hadoop.mapred.TaskTracker: > >>> LaunchTaskAction: attempt_200901280756_0012_m_000074_2 > >>> 2009-01-28 08:07:55,682 WARN org.apache.hadoop.mapred.TaskRunner: > >>> attempt_200901280756_0012_m_000074_2 Child Error > >>> java.io.IOException: Task process exit with nonzero status of 1. > >>> at > >>> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462) > >>> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403) > >>> > >>> I tried running the tasktrackers in debug mode but the entries above > are > >>> all that show up in the logs. > >>> As of now my cluster is down. > >>> > >>> > >>> > >>> > > -- > David O'Dell > Director, Operations > e: dod...@videoegg.com > t: (415) 738-5152 > 180 Townsend St., Third Floor > San Francisco, CA 94107 > >