On Monday 14 June 2010 08:03 AM, David Milne wrote:
Anybody? I am completely stuck here. I have no idea who else I can ask
or where I can go for more information. Is there somewhere specific
where I should be asking about HOD?

Thank you,
Dave

In the ringmaster logs, you should see which node was supposed to run Namenode. This can be found above the logs that you've printed. I can barely remember but I guess it reads something like getCommand(). Once you find out the node, check the hodring logs there, something must have gone wrong there.

The return code was 7 - indicating HDFS failure. See http://hadoop.apache.org/common/docs/r0.20.0/hod_user_guide.html#The+Exit+Codes+For+HOD+Are+Not+Getting+Into+Torque, and check if you are hitting one of the problems listed there.

HTH,
+vinod


On Thu, Jun 10, 2010 at 2:56 PM, David Milne<d.n.mi...@gmail.com>  wrote:
Hi there,

I am trying to get Hadoop on Demand up and running, but am having
problems with the ringmaster not being able to communicate with HDFS.

The output from the hod allocate command ends with this, with full verbosity:

[2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to retrieve
'hdfs' service address.
[2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster id
34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated.
[2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
[2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from rm.stop()
[2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
cluster /home/dmilne/hadoop/cluster
[2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7


I've attached the hodrc file below, but briefly HOD is supposed to
provision an HDFS cluster as well as a Map/Reduce cluster, and seems
to be failing to do so. The ringmaster log looks like this:

[2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs
[2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
service:<hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
[2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
addr hdfs: not found
[2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr name: hdfs
[2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
service:<hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
[2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
addr hdfs: not found

... and so on, until it gives up

Any ideas why? One red flag is that when running the allocate command,
some of the variables echo-ed back look dodgy:

--gridservice-hdfs.fs_port 0
--gridservice-hdfs.host localhost
--gridservice-hdfs.info_port 0

These are not what I specified in the hodrc. Are the port numbers just
set to 0 because I am not using an external HDFS, or is this a
problem?


The software versions involved are:
  - Hadoop 0.20.2
  - Python 2.5.2 (no Twisted)
  - Java 1.6.0_20
  - Torque 2.4.5


The hodrc file looks like this:

[hod]
stream                          = True
java-home                       = /opt/jdk1.6.0_20
cluster                         = debian5
cluster-factor                  = 1.8
xrs-port-range                  = 32768-65536
debug                           = 3
allocate-wait-time              = 3600
temp-dir                        = /scratch/local/dmilne/hod

[ringmaster]
register                        = True
stream                          = False
temp-dir                        = /scratch/local/dmilne/hod
log-dir                         = /scratch/local/dmilne/hod/log
http-port-range                 = 8000-9000
idleness-limit                  = 864000
work-dirs                       =
/scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
xrs-port-range                  = 32768-65536
debug                           = 4

[hodring]
stream                          = False
temp-dir                        = /scratch/local/dmilne/hod
log-dir                         = /scratch/local/dmilne/hod/log
register                        = True
java-home                       = /opt/jdk1.6.0_20
http-port-range                 = 8000-9000
xrs-port-range                  = 32768-65536
debug                           = 4

[resource_manager]
queue                           = express
batch-home                      = /opt/torque-2.4.5
id                              = torque
options                         = l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
#env-vars                       =
HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python

[gridservice-mapred]
external                        = False
pkgs                            = /opt/hadoop-0.20.2
tracker_port                    = 8030
info_port                       = 50080

[gridservice-hdfs]
external                        = False
pkgs                            = /opt/hadoop-0.20.2
fs_port                         = 8020
info_port                       = 50070

Cheers,
Dave


Reply via email to