Re: sudden instability in 0.18.2

Aaron Kimball Wed, 28 Jan 2009 17:55:10 -0800

Wow. How many subdirectories were there? how many jobs do you run a day?

- Aaron


On Wed, Jan 28, 2009 at 12:13 PM, David J. O'Dell <dod...@videoegg.com>wrote:

> It was failing on all the nodes both new and old.
> The problem was there were too many subdirectories under
> $HADOOP_HOME/logs/userlogs
> The fix was just to delete the subdirs and change this setting from 24
> hours(the default) to 2 hours.
> mapred.userlog.retain.hours
>
> Would have been nice if there was an error message that pointed to this.
>
>
> Aaron Kimball wrote:
> > Hi David,
> >
> > If your tasks are failing on only the new nodes, it's likely that you're
> > missing a library or something on those machines. See this Hadoop
> tutorial
> > http://public.yahoo.com/gogate/hadoop-tutorial/html/module5.html about
> > "distributing debug scripts." These will allow you to capture stdout/err
> and
> > the syslog from tasks that fail.
> >
> > - Aaron
> >
> > On Wed, Jan 28, 2009 at 9:40 AM, Sagar Naik <sn...@attributor.com>
> wrote:
> >
> >
> >> Pl check which nodes have these failures.
> >>
> >> I guess the new tasktrackers/machines  are not configured correctly.
> >> As a result, the map-task will die and the remaining map-tasks will be
> >> sucked onto these machines
> >>
> >>
> >> -Sagar
> >>
> >>
> >> David J. O'Dell wrote:
> >>
> >>
> >>> We've been running 0.18.2 for over a month on an 8 node cluster.
> >>> Last week we added 4 more nodes to the cluster and have experienced 2
> >>> failures to the tasktrackers since then.
> >>> The namenodes are running fine but all jobs submitted will die when
> >>> submitted with this error on the tasktrackers.
> >>>
> >>> 2009-01-28 08:07:55,556 INFO org.apache.hadoop.mapred.TaskTracker:
> >>> LaunchTaskAction: attempt_200901280756_0012_m_000074_2
> >>> 2009-01-28 08:07:55,682 WARN org.apache.hadoop.mapred.TaskRunner:
> >>> attempt_200901280756_0012_m_000074_2 Child Error
> >>> java.io.IOException: Task process exit with nonzero status of 1.
> >>>        at
> >>> org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
> >>>        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
> >>>
> >>> I tried running the tasktrackers in debug mode but the entries above
> are
> >>> all that show up in the logs.
> >>> As of now my cluster is down.
> >>>
> >>>
> >>>
> >>>
>
> --
> David O'Dell
> Director, Operations
> e: dod...@videoegg.com
> t:  (415) 738-5152
> 180 Townsend St., Third Floor
> San Francisco, CA 94107
>
>

Re: sudden instability in 0.18.2

Reply via email to