I'm trying to use a small cluster to make sure I understand the setup
and have my code running before going to a big cluster. I have two
machines. I've followed the tutorial here:
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
I have been using 0.20.203 -- is this the most stable version of pre-1.0
code?
The cluster seemed fine for some time except for the occasional HDFS
corruption, a know issue. I have run mostly mahout code unaltered with
success.
However I am now getting some consistent errors with mahout and bixo
(only recently started using this). When I start a job from the master,
say a command line mahout job, the slave dies pretty quickly. It looks
like spawned threads never complete and kill the slave. Hadoop may
recover or it may not depending on what it is doing.
In any case when I go to the slave and do ps -e I get a huge list of
"fuser <defunct>" with a long list of pids.
The datanode logs on the slave have this warning:
pat@occam:~$ tail -f
hadoop-0.20.203.0/logs/hadoop-pat-datanode-occam.log
2012-05-03 08:39:39,035 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
threadgroup to exit, active threads is 1
2012-05-03 08:39:40,035 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
threadgroup to exit, active threads is 1
2012-05-03 08:39:41,035 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
threadgroup to exit, active threads is 1
2012-05-03 08:39:42,036 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
threadgroup to exit, active threads is 1
etc....
So far I have removed the slave from the master's config and set
replication to 1 and all works, just slower.
Any ideas? and should I upgrade to a newer version?