Am 12.11.2012 um 14:48 schrieb Drew Kitchen:

> Dear List,
> 
> I've installed OGE on a mini-cluster of iMacs running OS X 10.6.8, and it 
> seems to be
> working but with one semi-major glitch. (Why iMacs, you ask...well, they are 
> what I
> inherited from a guy that moved his lab...5 iMacs and various other boxes.)
> 
> I compiled the OGE source locally, and that went great after I tweaked it to 
> find
> darwin-x64 and whatnot. Installation went great, following the wonderful 
> install vids
> that have been posted for GE on Mac OS X. I have qmaster running on 
> dhcp80fff96b, with
> three execution hosts (dhcp80fff96b, dhcp80fff9b6, and dhcp80fff90d), and an 
> NFS share
> between them (where GE resides). Passwordless ssh is enabled for the GE 
> owner, so the
> boxes should be able to communicate.

This shouldn't be necessary for the operation of OGE - just for the 
installation it *might* be necessary (but you can also do it without by local 
installations).


> So, this is where the problems arise: in all.q, the execution host on the 
> master node
> running qmaster throws an E status.
> 
> <cut>
> dhcp80fff96b:~ akitchen$ qstat -f
> queuename                      qtype resv/used/tot. load_avg arch          
> states
> ---------------------------------------------------------------------------------
>  
> [email protected]   0/0/2 0.02     darwin-x64    E
> ---------------------------------------------------------------------------------
>  
> [email protected]   0/0/2 0.00     darwin-x64
> ---------------------------------------------------------------------------------
>  
> [email protected]   0/0/2 0.00     darwin-x64
> <cut>
> 
> I can submit jobs and they will be successfully farmed out to the external 
> execution
> hosts, so it would seem that everything is fine and dandy. Meanwhile, the 
> execution
> daemon is working on the master node.
> 
> <cut>
> dhcp80fff96b:~ akitchen$ qping dhcp80fff96b.state.edu 6445 execd 1
> 11/09/2012 17:08:25 endpoint dhcp80fff96b.state.edu/execd/1 at port 6445 is 
> up since 89828 seconds
> <cut>
> 
> I've tried just about everything (even rebooting the master node), and 
> nothing seems to
> solve this. I've looked in the spool messages to troubleshoot, and I get a 
> cryptic
> "commlib error".
> 
> <cut>
> 11/07/2012 15:27:47|  main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 
> (darwin-x64)
> 11/08/2012 10:43:00|  main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 
> (darwin-x64)
> 11/08/2012 10:43:02|  main|dhcp80fff96b|E|commlib error: got read error 
> (closing "dhcp80fff96b.state.edu/qmaster/1")
> 11/08/2012 10:43:03|  main|dhcp80fff96b|W|can't register at qmaster 
> "dhcp80fff96b.state.edu": abort qmaster registration due to communication 
> errors
> 11/08/2012 10:43:03|  main|dhcp80fff96b|E|commlib error: can't connect to 
> service (Connection refused)

The ports 6444 and 6445 are excluded from the firewalls?

All machines get always the same address?

-- Reuti


> 11/08/2012 10:43:35|  main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 
> (darwin-x64)
> 11/08/2012 10:52:45|  main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 
> (darwin-x64)
> 11/08/2012 12:31:14|  main|dhcp80fff96b|I|controlled shutdown 2011.11p1
> 11/08/2012 12:31:14|  main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 
> (darwin-x64)
> <cut>
> 
> Otherwise, everything seems to be running fine. I've scrounged around and 
> found a couple
> Mac Minis that I'd like to add to the mini-cluster, but I'd rather figure 
> this out
> before adding them (and maybe shifting qmaster to one of them).
> 
> Any help would be greatly appreciated!
> 
> Cheers and best,
> Drew
> 
> P.S. Here is some more info for anyone curious....
> 
> 
> dhcp80fff96b:~ akitchen$ hostname
> dhcp80fff96b.state.edu
> 
> dhcp80fff96b:~ akitchen$ /GridEngine/utilbin/darwin-x64/./gethostname
> Hostname: dhcp80fff96b.state.edu
> Aliases:  ANTH-M014 dhcp80fff96b
> Host Address(es): XXX.XXX.XXX.107
> 
> dhcp80fff96b:~ akitchen$ /GridEngine/utilbin/darwin-x64/./gethostbyaddr 
> XXX.XXX.XXX.107
> Hostname: dhcp80fff96b.state.edu
> Aliases:  ANTH-M014 dhcp80fff96b
> Host Address(es): XXX.XXX.XXX.107
> 
> dhcp80fff96b:~ akitchen$ /GridEngine/utilbin/darwin-x64/./gethostbyname 
> dhcp80fff96b.state.edu
> Hostname: dhcp80fff96b.state.edu
> Aliases:  ANTH-M014 dhcp80fff96b
> Host Address(es): XXX.XXX.XXX.107
> 
> dhcp80fff96b:~ akitchen$ cat /etc/hosts
> ##
> # Host Database
> #
> # localhost is used to configure the loopback interface
> # when the system is booting.  Do not change this entry.
> ##
> 127.0.0.1    localhost
> 255.255.255.255    broadcasthost
> ::1             localhost
> fe80::1%lo0    localhost
> XXX.XXX.XXX.107 dhcp80fff96b.state.edu ANTH-M014 dhcp80fff96b
> XXX.XXX.XXX.182 dhcp80fff9b6.state.edu ANTH-M036 dhcp80fff9b6
> XXX.XXX.XXX.208 dhcp80fff9d0.state.edu ANTH-M013 dhcp80fff9d0
> 
> dhcp80fff96b:~ akitchen$ qconf -shgrp @allhosts
> group_name @allhosts
> hostlist dhcp80fff96b.state.edu dhcp80fff9d0.state.edu \
>         dhcp80fff9b6.state.edu
> 
> dhcp80fff96b:~ akitchen$ qconf -sel
> dhcp80fff96b.state.edu
> dhcp80fff9b6.state.edu
> dhcp80fff9d0.state.edu
> 
> dhcp80fff96b:~ akitchen$ qconf -ss
> dhcp80fff96b.state.edu
> 
> dhcp80fff96b:~ akitchen$ qconf -sh
> dhcp80fff96b.state.edu
> dhcp80fff9b6.state.edu
> dhcp80fff9d0.state.edu
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to