I'm trying to use a small cluster to make sure I understand the setup and have my code running before going to a big cluster. I have two machines. I've followed the tutorial here: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ I have been using 0.20.203 -- is this the most stable version of pre-1.0 code?

The cluster seemed fine for some time except for the occasional HDFS corruption, a know issue. I have run mostly mahout code unaltered with success.

However I am now getting some consistent errors with mahout and bixo (only recently started using this). When I start a job from the master, say a command line mahout job, the slave dies pretty quickly. It looks like spawned threads never complete and kill the slave. Hadoop may recover or it may not depending on what it is doing.

In any case when I go to the slave and do ps -e I get a huge list of

   "fuser <defunct>" with a long list of pids.


The datanode logs on the slave have this warning:

   pat@occam:~$ tail -f
   hadoop-0.20.203.0/logs/hadoop-pat-datanode-occam.log
   2012-05-03 08:39:39,035 INFO
   org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
   threadgroup to exit, active threads is 1
   2012-05-03 08:39:40,035 INFO
   org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
   threadgroup to exit, active threads is 1
   2012-05-03 08:39:41,035 INFO
   org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
   threadgroup to exit, active threads is 1
   2012-05-03 08:39:42,036 INFO
   org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
   threadgroup to exit, active threads is 1
   etc....

So far I have removed the slave from the master's config and set replication to 1 and all works, just slower.

Any ideas? and should I upgrade to a newer version?



Reply via email to