So many unexpected "Lost task tracker" errors making the job to be killed Options

Marc Sturlese Mon, 09 May 2011 01:31:02 -0700

Hey there, I have a small cluster running on 0.20.2. Everything is 
fine but once in a while, when a job with a lot of map tasks is 
running I start getting the error: 
Lost task tracker: tracker_cluster1:localhost.localdomain/ 
127.0.0.1:xxxxx 
Before getting the error, the task attempt has been running for 7h 
(and normally it takes 46sec to complete). Sometimes, another task 
attempt is launched in paralel, takes 50 sec. to complete and so the 
first one gets killed (the second one can even be launched in the same 
task tracker and work). But in the end, I get so many "Lost task 
tracker" so the job get killed. 
The job will end up with some of the task trackers blacklisted. 
If I kill the "zombie tasks", remove the jobtracker and tasktracer pid 
files, remove the userlogs and stop/start mapred, everything works 
fine again, but some days later, the error will happen again. 
Any idea why this happens? Could someway be related with having too 
many attempt folders in the userlogs (even that there is space left on 
device)? 
Thanks in advance.


--
View this message in context: 
http://lucene.472066.n3.nabble.com/So-many-unexpected-Lost-task-tracker-errors-making-the-job-to-be-killed-Options-tp2917961p2917961.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

So many unexpected "Lost task tracker" errors making the job to be killed Options

Reply via email to