Re: Problems with HOD and HDFS

Edward Capriolo Mon, 14 Jun 2010 08:26:40 -0700

On Mon, Jun 14, 2010 at 8:37 AM, Amr Awadallah <a...@cloudera.com> wrote:


> Dave,
>
>  Yes, many others have the same situation, the recommended solution is
> either to use the Fair Share Scheduler or the Capacity Scheduler. These
> schedulers are much better than HOD since they take data locality into
> consideration (they don't just spin up 20 TT nodes on machines that have
> nothing to do with your data). They also don't lock down the nodes just for
> you, so as TT are freed other jobs can use them immediately (as opposed to
> no body can use them till your entire job is done).
>
>  Also, if you are brave and want to try something spanking new, then I
> recommend you reach out to the Mesos guys, they have a scheduler layer
> under
> Hadoop that is data locality aware:
>
> http://mesos.berkeley.edu/
>
> -- amr
>
> On Sun, Jun 13, 2010 at 9:21 PM, David Milne <d.n.mi...@gmail.com> wrote:
>
> > Ok, thanks Jeff.
> >
> > This is pretty surprising though. I would have thought many people
> > would be in my position, where they have to use Hadoop on a general
> > purpose cluster, and need it to play nice with a resource manager?
> > What do other people do in this position, if they don't use HOD?
> > Deprecated normally means there is a better alternative.
> >
> > - Dave
> >
> > On Mon, Jun 14, 2010 at 2:39 PM, Jeff Hammerbacher <ham...@cloudera.com>
> > wrote:
> > > Hey Dave,
> > >
> > > I can't speak for the folks at Yahoo!, but from watching the JIRA, I
> > don't
> > > think HOD is actively used or developed anywhere these days. You're
> > > attempting to use a mostly deprecated project, and hence not receiving
> > any
> > > support on the mailing list.
> > >
> > > Thanks,
> > > Jeff
> > >
> > > On Sun, Jun 13, 2010 at 7:33 PM, David Milne <d.n.mi...@gmail.com>
> > wrote:
> > >
> > >> Anybody? I am completely stuck here. I have no idea who else I can ask
> > >> or where I can go for more information. Is there somewhere specific
> > >> where I should be asking about HOD?
> > >>
> > >> Thank you,
> > >> Dave
> > >>
> > >> On Thu, Jun 10, 2010 at 2:56 PM, David Milne <d.n.mi...@gmail.com>
> > wrote:
> > >> > Hi there,
> > >> >
> > >> > I am trying to get Hadoop on Demand up and running, but am having
> > >> > problems with the ringmaster not being able to communicate with
> HDFS.
> > >> >
> > >> > The output from the hod allocate command ends with this, with full
> > >> verbosity:
> > >> >
> > >> > [2010-06-10 14:40:22,650] CRITICAL/50 hadoop:298 - Failed to
> retrieve
> > >> > 'hdfs' service address.
> > >> > [2010-06-10 14:40:22,654] DEBUG/10 hadoop:631 - Cleaning up cluster
> id
> > >> > 34029.symphony.cs.waikato.ac.nz, as cluster could not be allocated.
> > >> > [2010-06-10 14:40:22,655] DEBUG/10 hadoop:635 - Calling rm.stop()
> > >> > [2010-06-10 14:40:22,665] DEBUG/10 hadoop:637 - Returning from
> > rm.stop()
> > >> > [2010-06-10 14:40:22,666] CRITICAL/50 hod:401 - Cannot allocate
> > >> > cluster /home/dmilne/hadoop/cluster
> > >> > [2010-06-10 14:40:23,090] DEBUG/10 hod:597 - return code: 7
> > >> >
> > >> >
> > >> > I've attached the hodrc file below, but briefly HOD is supposed to
> > >> > provision an HDFS cluster as well as a Map/Reduce cluster, and seems
> > >> > to be failing to do so. The ringmaster log looks like this:
> > >> >
> > >> > [2010-06-10 14:36:05,144] DEBUG/10 ringMaster:479 - getServiceAddr
> > name:
> > >> hdfs
> > >> > [2010-06-10 14:36:05,145] DEBUG/10 ringMaster:487 - getServiceAddr
> > >> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
> > >> > [2010-06-10 14:36:05,147] DEBUG/10 ringMaster:504 - getServiceAddr
> > >> > addr hdfs: not found
> > >> > [2010-06-10 14:36:06,195] DEBUG/10 ringMaster:479 - getServiceAddr
> > name:
> > >> hdfs
> > >> > [2010-06-10 14:36:06,197] DEBUG/10 ringMaster:487 - getServiceAddr
> > >> > service: <hodlib.GridServices.hdfs.Hdfs instance at 0x8f97e8>
> > >> > [2010-06-10 14:36:06,198] DEBUG/10 ringMaster:504 - getServiceAddr
> > >> > addr hdfs: not found
> > >> >
> > >> > ... and so on, until it gives up
> > >> >
> > >> > Any ideas why? One red flag is that when running the allocate
> command,
> > >> > some of the variables echo-ed back look dodgy:
> > >> >
> > >> > --gridservice-hdfs.fs_port 0
> > >> > --gridservice-hdfs.host localhost
> > >> > --gridservice-hdfs.info_port 0
> > >> >
> > >> > These are not what I specified in the hodrc. Are the port numbers
> just
> > >> > set to 0 because I am not using an external HDFS, or is this a
> > >> > problem?
> > >> >
> > >> >
> > >> > The software versions involved are:
> > >> >  - Hadoop 0.20.2
> > >> >  - Python 2.5.2 (no Twisted)
> > >> >  - Java 1.6.0_20
> > >> >  - Torque 2.4.5
> > >> >
> > >> >
> > >> > The hodrc file looks like this:
> > >> >
> > >> > [hod]
> > >> > stream                          = True
> > >> > java-home                       = /opt/jdk1.6.0_20
> > >> > cluster                         = debian5
> > >> > cluster-factor                  = 1.8
> > >> > xrs-port-range                  = 32768-65536
> > >> > debug                           = 3
> > >> > allocate-wait-time              = 3600
> > >> > temp-dir                        = /scratch/local/dmilne/hod
> > >> >
> > >> > [ringmaster]
> > >> > register                        = True
> > >> > stream                          = False
> > >> > temp-dir                        = /scratch/local/dmilne/hod
> > >> > log-dir                         = /scratch/local/dmilne/hod/log
> > >> > http-port-range                 = 8000-9000
> > >> > idleness-limit                  = 864000
> > >> > work-dirs                       =
> > >> > /scratch/local/dmilne/hod/1,/scratch/local/dmilne/hod/2
> > >> > xrs-port-range                  = 32768-65536
> > >> > debug                           = 4
> > >> >
> > >> > [hodring]
> > >> > stream                          = False
> > >> > temp-dir                        = /scratch/local/dmilne/hod
> > >> > log-dir                         = /scratch/local/dmilne/hod/log
> > >> > register                        = True
> > >> > java-home                       = /opt/jdk1.6.0_20
> > >> > http-port-range                 = 8000-9000
> > >> > xrs-port-range                  = 32768-65536
> > >> > debug                           = 4
> > >> >
> > >> > [resource_manager]
> > >> > queue                           = express
> > >> > batch-home                      = /opt/torque-2.4.5
> > >> > id                              = torque
> > >> > options                         =
> > >> l:pmem=3812M,W:X="NACCESSPOLICY:SINGLEJOB"
> > >> > #env-vars                       =
> > >> > HOD_PYTHON_HOME=/foo/bar/python-2.5.1/bin/python
> > >> >
> > >> > [gridservice-mapred]
> > >> > external                        = False
> > >> > pkgs                            = /opt/hadoop-0.20.2
> > >> > tracker_port                    = 8030
> > >> > info_port                       = 50080
> > >> >
> > >> > [gridservice-hdfs]
> > >> > external                        = False
> > >> > pkgs                            = /opt/hadoop-0.20.2
> > >> > fs_port                         = 8020
> > >> > info_port                       = 50070
> > >> >
> > >> > Cheers,
> > >> > Dave
> > >> >
> > >>
> > >
> >
>

I have not used it much, but I think HOD is pretty cool. I guess most people
who are looking to (spin up, run job ,transfer off, spin down) are using
EC2. HOD does something like make private hadoop clouds on your hardware and
many probably do not have that use case. As schedulers advance and get
better HOD becomes less attractive, but I can always see a place for it.

Re: Problems with HOD and HDFS

Reply via email to