Am 12.11.2012 um 14:48 schrieb Drew Kitchen: > Dear List, > > I've installed OGE on a mini-cluster of iMacs running OS X 10.6.8, and it > seems to be > working but with one semi-major glitch. (Why iMacs, you ask...well, they are > what I > inherited from a guy that moved his lab...5 iMacs and various other boxes.) > > I compiled the OGE source locally, and that went great after I tweaked it to > find > darwin-x64 and whatnot. Installation went great, following the wonderful > install vids > that have been posted for GE on Mac OS X. I have qmaster running on > dhcp80fff96b, with > three execution hosts (dhcp80fff96b, dhcp80fff9b6, and dhcp80fff90d), and an > NFS share > between them (where GE resides). Passwordless ssh is enabled for the GE > owner, so the > boxes should be able to communicate.
This shouldn't be necessary for the operation of OGE - just for the installation it *might* be necessary (but you can also do it without by local installations). > So, this is where the problems arise: in all.q, the execution host on the > master node > running qmaster throws an E status. > > <cut> > dhcp80fff96b:~ akitchen$ qstat -f > queuename qtype resv/used/tot. load_avg arch > states > --------------------------------------------------------------------------------- > > [email protected] 0/0/2 0.02 darwin-x64 E > --------------------------------------------------------------------------------- > > [email protected] 0/0/2 0.00 darwin-x64 > --------------------------------------------------------------------------------- > > [email protected] 0/0/2 0.00 darwin-x64 > <cut> > > I can submit jobs and they will be successfully farmed out to the external > execution > hosts, so it would seem that everything is fine and dandy. Meanwhile, the > execution > daemon is working on the master node. > > <cut> > dhcp80fff96b:~ akitchen$ qping dhcp80fff96b.state.edu 6445 execd 1 > 11/09/2012 17:08:25 endpoint dhcp80fff96b.state.edu/execd/1 at port 6445 is > up since 89828 seconds > <cut> > > I've tried just about everything (even rebooting the master node), and > nothing seems to > solve this. I've looked in the spool messages to troubleshoot, and I get a > cryptic > "commlib error". > > <cut> > 11/07/2012 15:27:47| main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 > (darwin-x64) > 11/08/2012 10:43:00| main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 > (darwin-x64) > 11/08/2012 10:43:02| main|dhcp80fff96b|E|commlib error: got read error > (closing "dhcp80fff96b.state.edu/qmaster/1") > 11/08/2012 10:43:03| main|dhcp80fff96b|W|can't register at qmaster > "dhcp80fff96b.state.edu": abort qmaster registration due to communication > errors > 11/08/2012 10:43:03| main|dhcp80fff96b|E|commlib error: can't connect to > service (Connection refused) The ports 6444 and 6445 are excluded from the firewalls? All machines get always the same address? -- Reuti > 11/08/2012 10:43:35| main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 > (darwin-x64) > 11/08/2012 10:52:45| main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 > (darwin-x64) > 11/08/2012 12:31:14| main|dhcp80fff96b|I|controlled shutdown 2011.11p1 > 11/08/2012 12:31:14| main|dhcp80fff96b|I|starting up OGS/GE 2011.11p1 > (darwin-x64) > <cut> > > Otherwise, everything seems to be running fine. I've scrounged around and > found a couple > Mac Minis that I'd like to add to the mini-cluster, but I'd rather figure > this out > before adding them (and maybe shifting qmaster to one of them). > > Any help would be greatly appreciated! > > Cheers and best, > Drew > > P.S. Here is some more info for anyone curious.... > > > dhcp80fff96b:~ akitchen$ hostname > dhcp80fff96b.state.edu > > dhcp80fff96b:~ akitchen$ /GridEngine/utilbin/darwin-x64/./gethostname > Hostname: dhcp80fff96b.state.edu > Aliases: ANTH-M014 dhcp80fff96b > Host Address(es): XXX.XXX.XXX.107 > > dhcp80fff96b:~ akitchen$ /GridEngine/utilbin/darwin-x64/./gethostbyaddr > XXX.XXX.XXX.107 > Hostname: dhcp80fff96b.state.edu > Aliases: ANTH-M014 dhcp80fff96b > Host Address(es): XXX.XXX.XXX.107 > > dhcp80fff96b:~ akitchen$ /GridEngine/utilbin/darwin-x64/./gethostbyname > dhcp80fff96b.state.edu > Hostname: dhcp80fff96b.state.edu > Aliases: ANTH-M014 dhcp80fff96b > Host Address(es): XXX.XXX.XXX.107 > > dhcp80fff96b:~ akitchen$ cat /etc/hosts > ## > # Host Database > # > # localhost is used to configure the loopback interface > # when the system is booting. Do not change this entry. > ## > 127.0.0.1 localhost > 255.255.255.255 broadcasthost > ::1 localhost > fe80::1%lo0 localhost > XXX.XXX.XXX.107 dhcp80fff96b.state.edu ANTH-M014 dhcp80fff96b > XXX.XXX.XXX.182 dhcp80fff9b6.state.edu ANTH-M036 dhcp80fff9b6 > XXX.XXX.XXX.208 dhcp80fff9d0.state.edu ANTH-M013 dhcp80fff9d0 > > dhcp80fff96b:~ akitchen$ qconf -shgrp @allhosts > group_name @allhosts > hostlist dhcp80fff96b.state.edu dhcp80fff9d0.state.edu \ > dhcp80fff9b6.state.edu > > dhcp80fff96b:~ akitchen$ qconf -sel > dhcp80fff96b.state.edu > dhcp80fff9b6.state.edu > dhcp80fff9d0.state.edu > > dhcp80fff96b:~ akitchen$ qconf -ss > dhcp80fff96b.state.edu > > dhcp80fff96b:~ akitchen$ qconf -sh > dhcp80fff96b.state.edu > dhcp80fff9b6.state.edu > dhcp80fff9d0.state.edu > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
