I think I've found my problem.  At some point about a week ago, I must
have tried to start new tasktracker processes on my worker nodes without
killing the ones that were already there.  The new processes died
immediately because their sockets were already in use.  The old
processes then took over their roles, running happily with new
JobTrackers and doing tasks as requested.  The pid files that are
supposed to point to the tasktrackers did not contain their pids,
however, and 'bin/stop-mapred.sh' chooses its targets from the pid
files.  So I could do 'bin/stop-mapred.sh' all day long without killing
them.  I ended up killing them explicitly one node at a time.

These tasktrackers knew the *old* config values that were in force when
they were started, so pushing the new values out to the worker nodes had
no effect.  

So.  Is there any mechanism for killing 'rogue' tasktrackers?  I'm a
little surprised that they are killed via their pids rather than by
sending them a kill signal via the same mechanism whereby they learn of
new work.

-Joel
 [EMAIL PROTECTED]

On Tue, 2008-09-23 at 14:29 -0700, Arun C Murthy wrote:
> On Sep 23, 2008, at 2:21 PM, Joel Welling wrote:
> 
> > Stopping and restarting the mapred service should push the new .xml  
> > file
> > out, should it not?  I've done 'bin/mapred-stop.sh',
> 
> No, you need to run 'bin/mapred-stop.sh', push it out to all the  
> machines and then do 'bin/mapred-start.sh'.
> 
> You do see it in your job's config - but that config isn't used by the  
> TaskTrackers. They use the config in their HADOOP_CONF_DIR; which is  
> why you'd need to push it to all machines.
> 
> Arun

Reply via email to