Problem with cluster

Pat Ferrel Thu, 03 May 2012 10:09:59 -0700

I'm trying to use a small cluster to make sure I understand the setupand have my code running before going to a big cluster. I have twomachines. I've followed the tutorial here:http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/I have been using 0.20.203 -- is this the most stable version of pre-1.0code?

The cluster seemed fine for some time except for the occasional HDFScorruption, a know issue. I have run mostly mahout code unaltered withsuccess.

However I am now getting some consistent errors with mahout and bixo(only recently started using this). When I start a job from the master,say a command line mahout job, the slave dies pretty quickly. It lookslike spawned threads never complete and kill the slave. Hadoop mayrecover or it may not depending on what it is doing.


In any case when I go to the slave and do ps -e I get a huge list of

   "fuser <defunct>" with a long list of pids.


The datanode logs on the slave have this warning:

   pat@occam:~$ tail -f
   hadoop-0.20.203.0/logs/hadoop-pat-datanode-occam.log
   2012-05-03 08:39:39,035 INFO
   org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
   threadgroup to exit, active threads is 1
   2012-05-03 08:39:40,035 INFO
   org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
   threadgroup to exit, active threads is 1
   2012-05-03 08:39:41,035 INFO
   org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
   threadgroup to exit, active threads is 1
   2012-05-03 08:39:42,036 INFO
   org.apache.hadoop.hdfs.server.datanode.DataNode: Waiting for
   threadgroup to exit, active threads is 1
   etc....

So far I have removed the slave from the master's config and setreplication to 1 and all works, just slower.


Any ideas? and should I upgrade to a newer version?

Problem with cluster

Reply via email to