sudden instability in 0.18.2

2009-01-28 Thread David J. O'Dell
We've been running 0.18.2 for over a month on an 8 node cluster. Last week we added 4 more nodes to the cluster and have experienced 2 failures to the tasktrackers since then. The namenodes are running fine but all jobs submitted will die when submitted with this error on the tasktrackers. 2009-01

Re: sudden instability in 0.18.2

2009-01-28 Thread Sagar Naik
Pl check which nodes have these failures. I guess the new tasktrackers/machines are not configured correctly. As a result, the map-task will die and the remaining map-tasks will be sucked onto these machines -Sagar David J. O'Dell wrote: We've been running 0.18.2 for over a month on an 8 n

Re: sudden instability in 0.18.2

2009-01-28 Thread Aaron Kimball
Hi David, If your tasks are failing on only the new nodes, it's likely that you're missing a library or something on those machines. See this Hadoop tutorial http://public.yahoo.com/gogate/hadoop-tutorial/html/module5.html about "distributing debug scripts." These will allow you to capture stdout/

Re: sudden instability in 0.18.2

2009-01-28 Thread David J. O'Dell
It was failing on all the nodes both new and old. The problem was there were too many subdirectories under $HADOOP_HOME/logs/userlogs The fix was just to delete the subdirs and change this setting from 24 hours(the default) to 2 hours. mapred.userlog.retain.hours Would have been nice if there was

Re: sudden instability in 0.18.2

2009-01-28 Thread Aaron Kimball
Wow. How many subdirectories were there? how many jobs do you run a day? - Aaron On Wed, Jan 28, 2009 at 12:13 PM, David J. O'Dell wrote: > It was failing on all the nodes both new and old. > The problem was there were too many subdirectories under > $HADOOP_HOME/logs/userlogs > The fix was just